Triton Production Operations
Production Architecture
Section titled “Production Architecture”Typical Kubernetes deployment:
flowchart LR Client[Clients] --> Gateway[Gateway or load balancer] Gateway --> Service[Kubernetes Service] Service --> Pod[Triton pod] Pod --> Models[Model repository cache] Pod --> GPU[GPU node] Pod --> Metrics[Prometheus metrics] GPU --> DCGM[DCGM exporter] CI[CI pipeline] --> Registry[Image and model registry] Registry --> GitOps[GitOps promotion] GitOps --> Pod Metrics --> SLO[SLO dashboards] DCGM --> SLO
Key production decisions:
- One model per deployment or multi-model server.
- Model repository storage and cache strategy.
- GPU SKU/pool placement.
- Dynamic batching policy.
- Model instance count.
- Rollout and rollback.
- Metrics/alerts.
One Model Per Triton vs Multi-Model
Section titled “One Model Per Triton vs Multi-Model”| Pattern | Pros | Cons |
|---|---|---|
| One model per deployment | Clear ownership, simple scaling, isolated failures. | More pods, lower sharing efficiency. |
| Multi-model Triton | Better sharing and centralization. | Noisy-neighbor risk, harder rollouts, blast radius. |
Senior answer:
For strict SLOs, I bias toward isolation. For many small models with compatible SLOs, multi-model serving can improve utilization if metrics and rollback are strong.
Kubernetes Probes
Section titled “Kubernetes Probes”Use separate probe intent:
- Startup probe: allow model load and warmup.
- Readiness probe: only receive traffic when expected model version is ready.
- Liveness probe: restart only unrecoverably stuck process.
Do not make liveness too aggressive. It can create crash loops during slow model loads or transient GPU pressure.
Better Readiness
Section titled “Better Readiness”Readiness should verify:
- Triton server ready.
- Target model version ready.
- Synthetic inference works.
- Optional: expected config/version metadata.
For heavyweight models, use a lightweight but representative request.
Rollout Pattern
Section titled “Rollout Pattern”- Build/publish runtime image.
- Publish immutable model artifact.
- Generate
config.pbtxt. - Validate repository layout.
- Run load/synthetic tests on target GPU.
- Deploy canary pod/pool.
- Warm model.
- Route small traffic.
- Gate on p99, queue time, errors, GPU memory, correctness.
- Expand or roll back.
stateDiagram-v2 [*] --> Build Build --> Validate Validate --> Canary Canary --> Warmup Warmup --> SmallTraffic SmallTraffic --> Expand: gates pass SmallTraffic --> Rollback: gates fail Expand --> FullTraffic: gates pass Expand --> Rollback: gates fail FullTraffic --> [*] Rollback --> PreviousGood PreviousGood --> [*]
Rollback Pattern
Section titled “Rollback Pattern”Rollback must include:
- Traffic route.
- Model artifact.
config.pbtxt.- Runtime image.
- TensorRT engine if applicable.
- Tokenizer/pre/post config.
Trap:
Rolling back only the Kubernetes Deployment image may not roll back the model if the repository path is mutable.
Autoscaling
Section titled “Autoscaling”Better signals:
- Queue duration.
- Request duration p95/p99.
- Inference request concurrency.
- GPU utilization.
- GPU memory.
- Pending pods.
- Model-specific throughput.
HPA on CPU alone is weak for Triton.
Model Repository In Kubernetes
Section titled “Model Repository In Kubernetes”Patterns:
- Init container downloads model and verifies checksum.
- Sidecar syncs model repository.
- CSI/object store mount.
- Baked model image.
- Read-only persistent volume.
Questions to ask:
- What happens if object storage is slow?
- Does every pod download simultaneously?
- Are artifacts immutable?
- How is checksum validated?
- How is rollback handled?
- Is model load observed?
Observability
Section titled “Observability”Dashboard sections:
- Request SLO by model/version.
- Queue duration.
- Compute duration.
- Batch size distribution if available.
- Error code breakdown.
- Model load status.
- GPU utilization/memory.
- Pod/node/GPU mapping.
- Rollout markers.
Alert on:
- SLO burn.
- Model unavailable.
- Queue duration sustained.
- GPU memory critical.
- Repeated model load failures.
- Triton process crash loops.
- Metrics endpoint missing.
Incident Drill: Model Load Failure
Section titled “Incident Drill: Model Load Failure”Symptoms:
- Pod starts but readiness fails.
- Triton logs model load errors.
- Model endpoint unavailable.
Debug:
- Repository layout.
config.pbtxt.- Backend availability.
- File permissions.
- Artifact checksum.
- TensorRT engine compatibility.
- GPU memory.
- Runtime image/backend version.
Mitigation:
- Stop rollout.
- Keep previous pods serving.
- Roll back artifact/config.
- Mark bad version.
- Add validation to promotion pipeline.
Incident Drill: p99 Spike
Section titled “Incident Drill: p99 Spike”Debug:
- Queue vs compute duration.
- Batch settings.
- Actual request shapes.
- Tenant mix.
- GPU memory/utilization.
- CPU preprocessing.
- Gateway retries/timeouts.
- Recent rollout/config changes.
Mitigation:
- Roll back config if tied to rollout.
- Tighten queue delay.
- Shift low-priority traffic.
- Add capacity if warm path exists.
- Disable noisy co-tenant model.
flowchart TD
Page[p99 page] --> Freeze[Freeze deploys and capture window]
Freeze --> Recent{Recent rollout or traffic shift}
Recent -- yes --> Mitigate[Rollback or reduce traffic]
Recent -- no --> Layer[Compare gateway, Triton, GPU, node]
Layer --> Queue{Queue time up}
Queue -- yes --> Admission[Lower admission or add warm capacity]
Queue -- no --> GPUPressure{GPU memory or SM saturated}
GPUPressure -- yes --> Capacity[Scale, reduce batch, or move tenants]
GPUPressure -- no --> External[Check DNS, network, client retries, storage]
Security
Section titled “Security”Production controls:
- Signed runtime image.
- Immutable model artifacts.
- Least privilege access to model store.
- Read-only mounts where possible.
- Network policy around metrics and inference ports.
- RBAC for repository control APIs.
- Audit logs for model load/unload.
Staff-Level Operating Model
Section titled “Staff-Level Operating Model”Define the contract between ML and platform:
- Artifact format.
- IO schema.
- Runtime/backend.
- Resource envelope.
- Valid request shapes.
- Health check.
- Rollout policy.
- Owner and escalation path.
- Observability labels.
- Performance validation.
What To Say In Interview
Section titled “What To Say In Interview”I would run Triton as a controlled production substrate, not as an opaque model container. The platform should validate repository layout, config, runtime compatibility, representative inference, metrics, and rollback before traffic shifts.