Triton Performance Engineering
Performance Goal
Section titled “Performance Goal”The goal is not maximum throughput. The goal is maximum useful throughput while meeting latency, correctness, cost, and reliability constraints.
flowchart LR
SLO[p99 and error SLO] --> Load[Realistic load profile]
Load --> Baseline[Baseline measurement]
Baseline --> Hypothesis[One tuning hypothesis]
Hypothesis --> Experiment[Controlled experiment]
Experiment --> Gate{Goodput improves without SLO regression}
Gate -- yes --> Keep[Keep change]
Gate -- no --> Revert[Revert and record]
Keep --> Next[Next bottleneck]
Revert --> Next
Use the word goodput:
Successful inference completed within SLO and correctness constraints.
Measurement Tools
Section titled “Measurement Tools”| Tool | Use |
|---|---|
| Prometheus metrics | Production request/GPU/server visibility. |
perf_analyzer | Generate load and compare latency/throughput behavior. |
| Model Analyzer | Search model config space: batch size, instance count, GPU memory, concurrency. |
| GenAI-Perf | Benchmark LLM-style serving behavior. |
| DCGM | GPU health/utilization/memory/ECC/Xid telemetry. |
| Nsight / backend profilers | Deep kernel/runtime analysis when needed. |
For OOM triage, accelerator utilization, and “is the GPU actually busy?” interview drills, use the companion runbook: GPU Profiling, OOM, and Utilization.
Triton Metrics
Section titled “Triton Metrics”Triton exposes Prometheus metrics, commonly at:
http://localhost:8002/metrics
Production scrape labels should let you split by:
- Cluster.
- Namespace.
- Pod.
- Model.
- Version.
- GPU.
- Node.
- Tenant or workload class if safely bounded.
Avoid unbounded labels like request ID or user ID.
Metrics To Know
Section titled “Metrics To Know”Track:
- Request success/error counts.
- Queue duration.
- Compute input duration.
- Inference compute duration.
- Compute output duration.
- Request duration.
- Model load/unload status.
- GPU utilization.
- GPU memory.
Senior interpretation:
- High queue + low GPU = scheduling/batching/concurrency problem.
- Low queue + high compute = model/GPU execution problem.
- High CPU + low GPU = preprocessing/serialization bottleneck.
- GPU memory near limit = batch/instance/KV/cache risk.
perf_analyzer Strategy
Section titled “perf_analyzer Strategy”Use perf_analyzer to test:
- Concurrency sweeps.
- Batch size behavior.
- p50/p95/p99 latency.
- Request payload realism.
- HTTP vs gRPC overhead.
- Streaming or sequence behavior if relevant.
Do not benchmark only with tiny synthetic payloads.
Senior caveat:
A benchmark that does not match request shape and client deadline is a story about the benchmark, not production.
Model Analyzer Strategy
Section titled “Model Analyzer Strategy”Model Analyzer helps explore:
- Instance count.
- Batch size.
- Dynamic batching.
- GPU memory tradeoffs.
- Throughput/latency Pareto frontier.
Use it to build a deployment recommendation, then validate in staging/canary with production-like traffic.
Performance Tuning Levers
Section titled “Performance Tuning Levers”| Lever | Helps | Can hurt |
|---|---|---|
| Dynamic batching | Throughput, GPU utilization. | p99, fairness. |
| Instance count | Concurrency. | Memory, contention. |
| TensorRT engine | GPU efficiency. | Build complexity, shape rigidity. |
| Precision FP16/INT8 | Throughput/memory. | Accuracy, calibration. |
| CPU thread pools | Pre/post speed. | CPU contention. |
| gRPC | Efficient binary protocol. | Client/proxy complexity. |
| Shared memory | Lower copy overhead. | Operational complexity. |
| Separate deployments | Isolation by SLO. | More capacity overhead. |
TensorRT Tuning
Section titled “TensorRT Tuning”TensorRT can improve latency/throughput but introduces:
- Engine build time.
- Hardware/driver/runtime compatibility.
- Precision and calibration choices.
- Static/dynamic shape profiles.
- Rebuild requirements for some changes.
Interview answer:
TensorRT is a strong production path when shapes and precision are understood. I would version engine build metadata with the model artifact so rollback and compatibility are clear.
Shape Profiles
Section titled “Shape Profiles”Variable shapes are a production issue:
- Too narrow profiles reject valid traffic.
- Too wide profiles can reduce optimization quality.
- Rare huge requests can dominate p99 and memory.
Mitigation:
- Request shape telemetry.
- Cost classes.
- Admission limits.
- Separate deployments for large requests.
- Explicit shape profiles.
Performance Debug Playbooks
Section titled “Performance Debug Playbooks”p99 High, Throughput Fine
Section titled “p99 High, Throughput Fine”Check:
- Queue delay.
- Batch wait.
- Tenant mix.
- Long-tail request shapes.
- Retries.
- Noisy co-located models.
- CPU/GPU memory pressure.
flowchart TD
Alert[p99 high] --> Errors{Errors or retries rising}
Errors -- yes --> Transport[Check gateway, client deadlines, overload]
Errors -- no --> Queue{Triton queue time high}
Queue -- yes --> Batching[Batching, admission, instance count]
Queue -- no --> Compute{GPU busy high}
Compute -- yes --> Saturation[Scale, optimize engine, reduce batch]
Compute -- no --> CPU[CPU preprocess, tokenizer, serialization, network]
CPU --> Trace[Use traces or span timing]
GPU Low, Latency High
Section titled “GPU Low, Latency High”Check:
- CPU preprocessing.
- Client/gateway bottleneck.
- Serialization.
- Model repository/artifact IO.
- Single-threaded Python backend.
- Dependency calls in Python backend.
GPU High, Errors High
Section titled “GPU High, Errors High”Check:
- Memory OOM.
- Timeout threshold.
- Thermal/power throttling.
- ECC/Xid.
- Batch too large.
- Instance count too high.
Canary Gates
Section titled “Canary Gates”Before production expansion:
- p99 below threshold.
- Error rate below threshold.
- Queue time stable.
- GPU memory headroom.
- No Xid/ECC anomaly.
- Correctness samples pass.
- No retry storm.
- Rollback tested.
Senior Answer To “How Would You Tune Triton?”
Section titled “Senior Answer To “How Would You Tune Triton?””I would first define the SLO and workload class. Then I would establish a baseline with realistic payloads, sweep concurrency/batching/instance count using perf_analyzer and Model Analyzer, validate GPU and request metrics, pick a Pareto point, and canary it under real traffic. I would optimize for goodput and p99, not just peak throughput.