Triton Performance Engineering

Performance Goal

The goal is not maximum throughput. The goal is maximum useful throughput while meeting latency, correctness, cost, and reliability constraints.

flowchart LR
  SLO[p99 and error SLO] --> Load[Realistic load profile]
  Load --> Baseline[Baseline measurement]
  Baseline --> Hypothesis[One tuning hypothesis]
  Hypothesis --> Experiment[Controlled experiment]
  Experiment --> Gate{Goodput improves without SLO regression}
  Gate -- yes --> Keep[Keep change]
  Gate -- no --> Revert[Revert and record]
  Keep --> Next[Next bottleneck]
  Revert --> Next

Use the word goodput:

Successful inference completed within SLO and correctness constraints.

Measurement Tools

Tool	Use
Prometheus metrics	Production request/GPU/server visibility.
`perf_analyzer`	Generate load and compare latency/throughput behavior.
Model Analyzer	Search model config space: batch size, instance count, GPU memory, concurrency.
GenAI-Perf	Benchmark LLM-style serving behavior.
DCGM	GPU health/utilization/memory/ECC/Xid telemetry.
Nsight / backend profilers	Deep kernel/runtime analysis when needed.

For OOM triage, accelerator utilization, and “is the GPU actually busy?” interview drills, use the companion runbook: GPU Profiling, OOM, and Utilization.

Triton Metrics

Triton exposes Prometheus metrics, commonly at:

http://localhost:8002/metrics

Production scrape labels should let you split by:

Cluster.
Namespace.
Pod.
Model.
Version.
GPU.
Node.
Tenant or workload class if safely bounded.

Avoid unbounded labels like request ID or user ID.

Metrics To Know

Track:

Request success/error counts.
Queue duration.
Compute input duration.
Inference compute duration.
Compute output duration.
Request duration.
Model load/unload status.
GPU utilization.
GPU memory.

Senior interpretation:

High queue + low GPU = scheduling/batching/concurrency problem.
Low queue + high compute = model/GPU execution problem.
High CPU + low GPU = preprocessing/serialization bottleneck.
GPU memory near limit = batch/instance/KV/cache risk.

perf_analyzer Strategy

Use perf_analyzer to test:

Concurrency sweeps.
Batch size behavior.
p50/p95/p99 latency.
Request payload realism.
HTTP vs gRPC overhead.
Streaming or sequence behavior if relevant.

Do not benchmark only with tiny synthetic payloads.

Senior caveat:

A benchmark that does not match request shape and client deadline is a story about the benchmark, not production.

Model Analyzer Strategy

Model Analyzer helps explore:

Instance count.
Batch size.
Dynamic batching.
GPU memory tradeoffs.
Throughput/latency Pareto frontier.

Use it to build a deployment recommendation, then validate in staging/canary with production-like traffic.

Performance Tuning Levers

Lever	Helps	Can hurt
Dynamic batching	Throughput, GPU utilization.	p99, fairness.
Instance count	Concurrency.	Memory, contention.
TensorRT engine	GPU efficiency.	Build complexity, shape rigidity.
Precision FP16/INT8	Throughput/memory.	Accuracy, calibration.
CPU thread pools	Pre/post speed.	CPU contention.
gRPC	Efficient binary protocol.	Client/proxy complexity.
Shared memory	Lower copy overhead.	Operational complexity.
Separate deployments	Isolation by SLO.	More capacity overhead.

TensorRT Tuning

TensorRT can improve latency/throughput but introduces:

Engine build time.
Hardware/driver/runtime compatibility.
Precision and calibration choices.
Static/dynamic shape profiles.
Rebuild requirements for some changes.

Interview answer:

TensorRT is a strong production path when shapes and precision are understood. I would version engine build metadata with the model artifact so rollback and compatibility are clear.

Shape Profiles

Variable shapes are a production issue:

Too narrow profiles reject valid traffic.
Too wide profiles can reduce optimization quality.
Rare huge requests can dominate p99 and memory.

Mitigation:

Request shape telemetry.
Cost classes.
Admission limits.
Separate deployments for large requests.
Explicit shape profiles.

Performance Debug Playbooks

p99 High, Throughput Fine

Check:

Queue delay.
Batch wait.
Tenant mix.
Long-tail request shapes.
Retries.
Noisy co-located models.
CPU/GPU memory pressure.

flowchart TD
  Alert[p99 high] --> Errors{Errors or retries rising}
  Errors -- yes --> Transport[Check gateway, client deadlines, overload]
  Errors -- no --> Queue{Triton queue time high}
  Queue -- yes --> Batching[Batching, admission, instance count]
  Queue -- no --> Compute{GPU busy high}
  Compute -- yes --> Saturation[Scale, optimize engine, reduce batch]
  Compute -- no --> CPU[CPU preprocess, tokenizer, serialization, network]
  CPU --> Trace[Use traces or span timing]

GPU Low, Latency High

Check:

CPU preprocessing.
Client/gateway bottleneck.
Serialization.
Model repository/artifact IO.
Single-threaded Python backend.
Dependency calls in Python backend.

GPU High, Errors High

Check:

Memory OOM.
Timeout threshold.
Thermal/power throttling.
ECC/Xid.
Batch too large.
Instance count too high.

Canary Gates

Before production expansion:

p99 below threshold.
Error rate below threshold.
Queue time stable.
GPU memory headroom.
No Xid/ECC anomaly.
Correctness samples pass.
No retry storm.
Rollback tested.

Senior Answer To “How Would You Tune Triton?”

I would first define the SLO and workload class. Then I would establish a baseline with realistic payloads, sweep concurrency/batching/instance count using perf_analyzer and Model Analyzer, validate GPU and request metrics, pick a Pareto point, and canary it under real traffic. I would optimize for goodput and p99, not just peak throughput.