Skip to content

Triton Performance Engineering

The goal is not maximum throughput. The goal is maximum useful throughput while meeting latency, correctness, cost, and reliability constraints.

flowchart LR
  SLO[p99 and error SLO] --> Load[Realistic load profile]
  Load --> Baseline[Baseline measurement]
  Baseline --> Hypothesis[One tuning hypothesis]
  Hypothesis --> Experiment[Controlled experiment]
  Experiment --> Gate{Goodput improves without SLO regression}
  Gate -- yes --> Keep[Keep change]
  Gate -- no --> Revert[Revert and record]
  Keep --> Next[Next bottleneck]
  Revert --> Next

Use the word goodput:

Successful inference completed within SLO and correctness constraints.

ToolUse
Prometheus metricsProduction request/GPU/server visibility.
perf_analyzerGenerate load and compare latency/throughput behavior.
Model AnalyzerSearch model config space: batch size, instance count, GPU memory, concurrency.
GenAI-PerfBenchmark LLM-style serving behavior.
DCGMGPU health/utilization/memory/ECC/Xid telemetry.
Nsight / backend profilersDeep kernel/runtime analysis when needed.

For OOM triage, accelerator utilization, and “is the GPU actually busy?” interview drills, use the companion runbook: GPU Profiling, OOM, and Utilization.

Triton exposes Prometheus metrics, commonly at:

http://localhost:8002/metrics

Production scrape labels should let you split by:

  • Cluster.
  • Namespace.
  • Pod.
  • Model.
  • Version.
  • GPU.
  • Node.
  • Tenant or workload class if safely bounded.

Avoid unbounded labels like request ID or user ID.

Track:

  • Request success/error counts.
  • Queue duration.
  • Compute input duration.
  • Inference compute duration.
  • Compute output duration.
  • Request duration.
  • Model load/unload status.
  • GPU utilization.
  • GPU memory.

Senior interpretation:

  • High queue + low GPU = scheduling/batching/concurrency problem.
  • Low queue + high compute = model/GPU execution problem.
  • High CPU + low GPU = preprocessing/serialization bottleneck.
  • GPU memory near limit = batch/instance/KV/cache risk.

Use perf_analyzer to test:

  • Concurrency sweeps.
  • Batch size behavior.
  • p50/p95/p99 latency.
  • Request payload realism.
  • HTTP vs gRPC overhead.
  • Streaming or sequence behavior if relevant.

Do not benchmark only with tiny synthetic payloads.

Senior caveat:

A benchmark that does not match request shape and client deadline is a story about the benchmark, not production.

Model Analyzer helps explore:

  • Instance count.
  • Batch size.
  • Dynamic batching.
  • GPU memory tradeoffs.
  • Throughput/latency Pareto frontier.

Use it to build a deployment recommendation, then validate in staging/canary with production-like traffic.

LeverHelpsCan hurt
Dynamic batchingThroughput, GPU utilization.p99, fairness.
Instance countConcurrency.Memory, contention.
TensorRT engineGPU efficiency.Build complexity, shape rigidity.
Precision FP16/INT8Throughput/memory.Accuracy, calibration.
CPU thread poolsPre/post speed.CPU contention.
gRPCEfficient binary protocol.Client/proxy complexity.
Shared memoryLower copy overhead.Operational complexity.
Separate deploymentsIsolation by SLO.More capacity overhead.

TensorRT can improve latency/throughput but introduces:

  • Engine build time.
  • Hardware/driver/runtime compatibility.
  • Precision and calibration choices.
  • Static/dynamic shape profiles.
  • Rebuild requirements for some changes.

Interview answer:

TensorRT is a strong production path when shapes and precision are understood. I would version engine build metadata with the model artifact so rollback and compatibility are clear.

Variable shapes are a production issue:

  • Too narrow profiles reject valid traffic.
  • Too wide profiles can reduce optimization quality.
  • Rare huge requests can dominate p99 and memory.

Mitigation:

  • Request shape telemetry.
  • Cost classes.
  • Admission limits.
  • Separate deployments for large requests.
  • Explicit shape profiles.

Check:

  • Queue delay.
  • Batch wait.
  • Tenant mix.
  • Long-tail request shapes.
  • Retries.
  • Noisy co-located models.
  • CPU/GPU memory pressure.
flowchart TD
  Alert[p99 high] --> Errors{Errors or retries rising}
  Errors -- yes --> Transport[Check gateway, client deadlines, overload]
  Errors -- no --> Queue{Triton queue time high}
  Queue -- yes --> Batching[Batching, admission, instance count]
  Queue -- no --> Compute{GPU busy high}
  Compute -- yes --> Saturation[Scale, optimize engine, reduce batch]
  Compute -- no --> CPU[CPU preprocess, tokenizer, serialization, network]
  CPU --> Trace[Use traces or span timing]

Check:

  • CPU preprocessing.
  • Client/gateway bottleneck.
  • Serialization.
  • Model repository/artifact IO.
  • Single-threaded Python backend.
  • Dependency calls in Python backend.

Check:

  • Memory OOM.
  • Timeout threshold.
  • Thermal/power throttling.
  • ECC/Xid.
  • Batch too large.
  • Instance count too high.

Before production expansion:

  • p99 below threshold.
  • Error rate below threshold.
  • Queue time stable.
  • GPU memory headroom.
  • No Xid/ECC anomaly.
  • Correctness samples pass.
  • No retry storm.
  • Rollback tested.

Senior Answer To “How Would You Tune Triton?”

Section titled “Senior Answer To “How Would You Tune Triton?””

I would first define the SLO and workload class. Then I would establish a baseline with realistic payloads, sweep concurrency/batching/instance count using perf_analyzer and Model Analyzer, validate GPU and request metrics, pick a Pareto point, and canary it under real traffic. I would optimize for goodput and p99, not just peak throughput.