Skip to content

Triton Production Operations

Typical Kubernetes deployment:

flowchart LR
  Client[Clients] --> Gateway[Gateway or load balancer]
  Gateway --> Service[Kubernetes Service]
  Service --> Pod[Triton pod]
  Pod --> Models[Model repository cache]
  Pod --> GPU[GPU node]
  Pod --> Metrics[Prometheus metrics]
  GPU --> DCGM[DCGM exporter]
  CI[CI pipeline] --> Registry[Image and model registry]
  Registry --> GitOps[GitOps promotion]
  GitOps --> Pod
  Metrics --> SLO[SLO dashboards]
  DCGM --> SLO

Key production decisions:

  • One model per deployment or multi-model server.
  • Model repository storage and cache strategy.
  • GPU SKU/pool placement.
  • Dynamic batching policy.
  • Model instance count.
  • Rollout and rollback.
  • Metrics/alerts.
PatternProsCons
One model per deploymentClear ownership, simple scaling, isolated failures.More pods, lower sharing efficiency.
Multi-model TritonBetter sharing and centralization.Noisy-neighbor risk, harder rollouts, blast radius.

Senior answer:

For strict SLOs, I bias toward isolation. For many small models with compatible SLOs, multi-model serving can improve utilization if metrics and rollback are strong.

Use separate probe intent:

  • Startup probe: allow model load and warmup.
  • Readiness probe: only receive traffic when expected model version is ready.
  • Liveness probe: restart only unrecoverably stuck process.

Do not make liveness too aggressive. It can create crash loops during slow model loads or transient GPU pressure.

Readiness should verify:

  • Triton server ready.
  • Target model version ready.
  • Synthetic inference works.
  • Optional: expected config/version metadata.

For heavyweight models, use a lightweight but representative request.

  1. Build/publish runtime image.
  2. Publish immutable model artifact.
  3. Generate config.pbtxt.
  4. Validate repository layout.
  5. Run load/synthetic tests on target GPU.
  6. Deploy canary pod/pool.
  7. Warm model.
  8. Route small traffic.
  9. Gate on p99, queue time, errors, GPU memory, correctness.
  10. Expand or roll back.
stateDiagram-v2
  [*] --> Build
  Build --> Validate
  Validate --> Canary
  Canary --> Warmup
  Warmup --> SmallTraffic
  SmallTraffic --> Expand: gates pass
  SmallTraffic --> Rollback: gates fail
  Expand --> FullTraffic: gates pass
  Expand --> Rollback: gates fail
  FullTraffic --> [*]
  Rollback --> PreviousGood
  PreviousGood --> [*]

Rollback must include:

  • Traffic route.
  • Model artifact.
  • config.pbtxt.
  • Runtime image.
  • TensorRT engine if applicable.
  • Tokenizer/pre/post config.

Trap:

Rolling back only the Kubernetes Deployment image may not roll back the model if the repository path is mutable.

Better signals:

  • Queue duration.
  • Request duration p95/p99.
  • Inference request concurrency.
  • GPU utilization.
  • GPU memory.
  • Pending pods.
  • Model-specific throughput.

HPA on CPU alone is weak for Triton.

Patterns:

  • Init container downloads model and verifies checksum.
  • Sidecar syncs model repository.
  • CSI/object store mount.
  • Baked model image.
  • Read-only persistent volume.

Questions to ask:

  • What happens if object storage is slow?
  • Does every pod download simultaneously?
  • Are artifacts immutable?
  • How is checksum validated?
  • How is rollback handled?
  • Is model load observed?

Dashboard sections:

  • Request SLO by model/version.
  • Queue duration.
  • Compute duration.
  • Batch size distribution if available.
  • Error code breakdown.
  • Model load status.
  • GPU utilization/memory.
  • Pod/node/GPU mapping.
  • Rollout markers.

Alert on:

  • SLO burn.
  • Model unavailable.
  • Queue duration sustained.
  • GPU memory critical.
  • Repeated model load failures.
  • Triton process crash loops.
  • Metrics endpoint missing.

Symptoms:

  • Pod starts but readiness fails.
  • Triton logs model load errors.
  • Model endpoint unavailable.

Debug:

  • Repository layout.
  • config.pbtxt.
  • Backend availability.
  • File permissions.
  • Artifact checksum.
  • TensorRT engine compatibility.
  • GPU memory.
  • Runtime image/backend version.

Mitigation:

  • Stop rollout.
  • Keep previous pods serving.
  • Roll back artifact/config.
  • Mark bad version.
  • Add validation to promotion pipeline.

Debug:

  • Queue vs compute duration.
  • Batch settings.
  • Actual request shapes.
  • Tenant mix.
  • GPU memory/utilization.
  • CPU preprocessing.
  • Gateway retries/timeouts.
  • Recent rollout/config changes.

Mitigation:

  • Roll back config if tied to rollout.
  • Tighten queue delay.
  • Shift low-priority traffic.
  • Add capacity if warm path exists.
  • Disable noisy co-tenant model.
flowchart TD
  Page[p99 page] --> Freeze[Freeze deploys and capture window]
  Freeze --> Recent{Recent rollout or traffic shift}
  Recent -- yes --> Mitigate[Rollback or reduce traffic]
  Recent -- no --> Layer[Compare gateway, Triton, GPU, node]
  Layer --> Queue{Queue time up}
  Queue -- yes --> Admission[Lower admission or add warm capacity]
  Queue -- no --> GPUPressure{GPU memory or SM saturated}
  GPUPressure -- yes --> Capacity[Scale, reduce batch, or move tenants]
  GPUPressure -- no --> External[Check DNS, network, client retries, storage]

Production controls:

  • Signed runtime image.
  • Immutable model artifacts.
  • Least privilege access to model store.
  • Read-only mounts where possible.
  • Network policy around metrics and inference ports.
  • RBAC for repository control APIs.
  • Audit logs for model load/unload.

Define the contract between ML and platform:

  • Artifact format.
  • IO schema.
  • Runtime/backend.
  • Resource envelope.
  • Valid request shapes.
  • Health check.
  • Rollout policy.
  • Owner and escalation path.
  • Observability labels.
  • Performance validation.

I would run Triton as a controlled production substrate, not as an opaque model container. The platform should validate repository layout, config, runtime compatibility, representative inference, metrics, and rollback before traffic shifts.