Inference Operations
Mental Model
Section titled “Mental Model”An inference platform turns model artifacts plus hardware capacity into low-latency, reliable predictions. Operations must keep five things true:
- The right model version is loaded and serving.
- Requests meet latency and availability SLOs.
- GPU capacity is allocated efficiently.
- Rollouts and rollbacks are controlled.
- Failures are isolated, diagnosed, and repaired quickly.
flowchart LR Product[Product request] --> Gateway[Gateway] Gateway --> Service[Inference service] Service --> Queue[Admission and queue] Queue --> Runtime[Model runtime] Runtime --> GPU[GPU execution] GPU --> Response[Response] Registry[Model registry] --> Runtime Observability[Metrics logs traces] -. observes .-> Gateway Observability -. observes .-> Queue Observability -. observes .-> GPU
Core Concepts
Section titled “Core Concepts”| Concept | Interview-ready explanation |
|---|---|
| Latency vs throughput | Inference optimizes both tail latency and tokens/images/requests per second. Batching can improve throughput but hurt p99 if not controlled. |
| Cold start | Model load, compilation, graph optimization, cache warmup, and GPU memory allocation can make new replicas slow or temporarily unavailable. |
| Dynamic batching | Groups requests to improve GPU utilization. Must respect max wait time and request shape compatibility. |
| GPU memory | Failures often come from fragmentation, model size, KV cache growth, or multiple processes contending for memory. |
| Model rollout | Needs artifact immutability, config validation, canarying, shadowing if possible, and fast rollback. |
| Multi-tenancy | Requires quotas, isolation, priority, noisy-neighbor controls, and fair scheduling. |
NVIDIA-Relevant Stack Terms
Section titled “NVIDIA-Relevant Stack Terms”You do not need to pretend expertise you do not have, but you should know the vocabulary:
- Triton Inference Server: NVIDIA model serving system supporting multiple backends, dynamic batching, model repository patterns, metrics, and concurrent model execution.
- TensorRT: SDK/runtime for optimizing neural network inference on NVIDIA GPUs.
- CUDA: GPU programming platform and runtime layer.
- NVIDIA GPU Operator: Kubernetes operator that manages GPU driver, container toolkit, device plugin, DCGM exporter, and related GPU components.
- DCGM: Data Center GPU Manager, commonly used for GPU health and telemetry.
- MIG: Multi-Instance GPU, hardware partitioning on supported NVIDIA GPUs.
Operational Metrics
Section titled “Operational Metrics”Track at these layers:
| Layer | Metrics |
|---|---|
| Request | RPS, error rate, p50/p95/p99 latency, timeout rate, queue time. |
| Model server | batch size, batch wait, inference time, model load failures, version status. |
| GPU | utilization, memory used/free, ECC errors, Xid errors, temperature, power, throttling. |
| Node | CPU, memory, disk, kernel logs, container runtime, driver health. |
| Scheduler | pending pods, unschedulable reasons, resource fragmentation, eviction events. |
| Deployment | canary health, rollback count, failed health checks, config validation failures. |
Common Failure Modes
Section titled “Common Failure Modes”- Pods pending because GPU resource requests cannot fit due to fragmentation or taints.
- Model loads but fails under real request shape due to memory spikes.
- p99 latency regresses after enabling batching or changing max batch delay.
- GPU node reports Xid/ECC errors; workloads fail or silently slow down.
- Driver/runtime/container toolkit mismatch after node image update.
- Autoscaler adds nodes but model warmup is too slow to absorb burst.
- Network or storage path bottlenecks model artifact download.
- Health checks pass while the model is loaded but not actually producing valid responses.
Strong Debug Answer
Section titled “Strong Debug Answer”Prompt: “p99 latency doubled after a rollout. What do you do?”
Answer frame:
- Freeze or slow rollout and compare canary vs baseline.
- Check request mix, traffic volume, error rate, queue time, and timeout rate.
- Split server time into preprocessing, queue/batch wait, GPU compute, postprocessing.
- Compare GPU utilization, memory, throttling, ECC/Xid, CPU saturation, and network/storage.
- Inspect model version/config changes: precision, batch settings, max sequence length, runtime image, driver compatibility.
- Roll back if SLO impact is active and evidence points to rollout.
- Preserve data and create prevention: load test gate, canary metric, config diff, model warmup validation.
Senior/Staff Close
Section titled “Senior/Staff Close”I would treat inference reliability as a contract between model teams and platform teams: model artifact, runtime image, resource envelope, health semantics, rollout policy, and observability requirements. Automation should enforce that contract before production traffic sees the model.