Triton Production Operations

Production Architecture

Typical Kubernetes deployment:

flowchart LR
  Client[Clients] --> Gateway[Gateway or load balancer]
  Gateway --> Service[Kubernetes Service]
  Service --> Pod[Triton pod]
  Pod --> Models[Model repository cache]
  Pod --> GPU[GPU node]
  Pod --> Metrics[Prometheus metrics]
  GPU --> DCGM[DCGM exporter]
  CI[CI pipeline] --> Registry[Image and model registry]
  Registry --> GitOps[GitOps promotion]
  GitOps --> Pod
  Metrics --> SLO[SLO dashboards]
  DCGM --> SLO

Key production decisions:

One model per deployment or multi-model server.
Model repository storage and cache strategy.
GPU SKU/pool placement.
Dynamic batching policy.
Model instance count.
Rollout and rollback.
Metrics/alerts.

One Model Per Triton vs Multi-Model

Pattern	Pros	Cons
One model per deployment	Clear ownership, simple scaling, isolated failures.	More pods, lower sharing efficiency.
Multi-model Triton	Better sharing and centralization.	Noisy-neighbor risk, harder rollouts, blast radius.

Senior answer:

For strict SLOs, I bias toward isolation. For many small models with compatible SLOs, multi-model serving can improve utilization if metrics and rollback are strong.

Kubernetes Probes

Use separate probe intent:

Startup probe: allow model load and warmup.
Readiness probe: only receive traffic when expected model version is ready.
Liveness probe: restart only unrecoverably stuck process.

Do not make liveness too aggressive. It can create crash loops during slow model loads or transient GPU pressure.

Better Readiness

Readiness should verify:

Triton server ready.
Target model version ready.
Synthetic inference works.
Optional: expected config/version metadata.

For heavyweight models, use a lightweight but representative request.

Rollout Pattern

Build/publish runtime image.
Publish immutable model artifact.
Generate config.pbtxt.
Validate repository layout.
Run load/synthetic tests on target GPU.
Deploy canary pod/pool.
Warm model.
Route small traffic.
Gate on p99, queue time, errors, GPU memory, correctness.
Expand or roll back.

stateDiagram-v2
  [*] --> Build
  Build --> Validate
  Validate --> Canary
  Canary --> Warmup
  Warmup --> SmallTraffic
  SmallTraffic --> Expand: gates pass
  SmallTraffic --> Rollback: gates fail
  Expand --> FullTraffic: gates pass
  Expand --> Rollback: gates fail
  FullTraffic --> [*]
  Rollback --> PreviousGood
  PreviousGood --> [*]

Rollback Pattern

Rollback must include:

Traffic route.
Model artifact.
config.pbtxt.
Runtime image.
TensorRT engine if applicable.
Tokenizer/pre/post config.

Trap:

Rolling back only the Kubernetes Deployment image may not roll back the model if the repository path is mutable.

Autoscaling

Better signals:

Queue duration.
Request duration p95/p99.
Inference request concurrency.
GPU utilization.
GPU memory.
Pending pods.
Model-specific throughput.

HPA on CPU alone is weak for Triton.

Model Repository In Kubernetes

Patterns:

Init container downloads model and verifies checksum.
Sidecar syncs model repository.
CSI/object store mount.
Baked model image.
Read-only persistent volume.

Questions to ask:

What happens if object storage is slow?
Does every pod download simultaneously?
Are artifacts immutable?
How is checksum validated?
How is rollback handled?
Is model load observed?

Observability

Dashboard sections:

Request SLO by model/version.
Queue duration.
Compute duration.
Batch size distribution if available.
Error code breakdown.
Model load status.
GPU utilization/memory.
Pod/node/GPU mapping.
Rollout markers.

Alert on:

SLO burn.
Model unavailable.
Queue duration sustained.
GPU memory critical.
Repeated model load failures.
Triton process crash loops.
Metrics endpoint missing.

Incident Drill: Model Load Failure

Symptoms:

Pod starts but readiness fails.
Triton logs model load errors.
Model endpoint unavailable.

Debug:

Repository layout.
config.pbtxt.
Backend availability.
File permissions.
Artifact checksum.
TensorRT engine compatibility.
GPU memory.
Runtime image/backend version.

Mitigation:

Stop rollout.
Keep previous pods serving.
Roll back artifact/config.
Mark bad version.
Add validation to promotion pipeline.

Incident Drill: p99 Spike

Debug:

Queue vs compute duration.
Batch settings.
Actual request shapes.
Tenant mix.
GPU memory/utilization.
CPU preprocessing.
Gateway retries/timeouts.
Recent rollout/config changes.

Mitigation:

Roll back config if tied to rollout.
Tighten queue delay.
Shift low-priority traffic.
Add capacity if warm path exists.
Disable noisy co-tenant model.

flowchart TD
  Page[p99 page] --> Freeze[Freeze deploys and capture window]
  Freeze --> Recent{Recent rollout or traffic shift}
  Recent -- yes --> Mitigate[Rollback or reduce traffic]
  Recent -- no --> Layer[Compare gateway, Triton, GPU, node]
  Layer --> Queue{Queue time up}
  Queue -- yes --> Admission[Lower admission or add warm capacity]
  Queue -- no --> GPUPressure{GPU memory or SM saturated}
  GPUPressure -- yes --> Capacity[Scale, reduce batch, or move tenants]
  GPUPressure -- no --> External[Check DNS, network, client retries, storage]

Security

Production controls:

Signed runtime image.
Immutable model artifacts.
Least privilege access to model store.
Read-only mounts where possible.
Network policy around metrics and inference ports.
RBAC for repository control APIs.
Audit logs for model load/unload.

Staff-Level Operating Model

Define the contract between ML and platform:

Artifact format.
IO schema.
Runtime/backend.
Resource envelope.
Valid request shapes.
Health check.
Rollout policy.
Owner and escalation path.
Observability labels.
Performance validation.

What To Say In Interview

I would run Triton as a controlled production substrate, not as an opaque model container. The platform should validate repository layout, config, runtime compatibility, representative inference, metrics, and rollback before traffic shifts.