Inference Operations

Mental Model

An inference platform turns model artifacts plus hardware capacity into low-latency, reliable predictions. Operations must keep five things true:

The right model version is loaded and serving.
Requests meet latency and availability SLOs.
GPU capacity is allocated efficiently.
Rollouts and rollbacks are controlled.
Failures are isolated, diagnosed, and repaired quickly.

flowchart LR
  Product[Product request] --> Gateway[Gateway]
  Gateway --> Service[Inference service]
  Service --> Queue[Admission and queue]
  Queue --> Runtime[Model runtime]
  Runtime --> GPU[GPU execution]
  GPU --> Response[Response]

  Registry[Model registry] --> Runtime
  Observability[Metrics logs traces] -. observes .-> Gateway
  Observability -. observes .-> Queue
  Observability -. observes .-> GPU

Core Concepts

Concept	Interview-ready explanation
Latency vs throughput	Inference optimizes both tail latency and tokens/images/requests per second. Batching can improve throughput but hurt p99 if not controlled.
Cold start	Model load, compilation, graph optimization, cache warmup, and GPU memory allocation can make new replicas slow or temporarily unavailable.
Dynamic batching	Groups requests to improve GPU utilization. Must respect max wait time and request shape compatibility.
GPU memory	Failures often come from fragmentation, model size, KV cache growth, or multiple processes contending for memory.
Model rollout	Needs artifact immutability, config validation, canarying, shadowing if possible, and fast rollback.
Multi-tenancy	Requires quotas, isolation, priority, noisy-neighbor controls, and fair scheduling.

NVIDIA-Relevant Stack Terms

You do not need to pretend expertise you do not have, but you should know the vocabulary:

Triton Inference Server: NVIDIA model serving system supporting multiple backends, dynamic batching, model repository patterns, metrics, and concurrent model execution.
TensorRT: SDK/runtime for optimizing neural network inference on NVIDIA GPUs.
CUDA: GPU programming platform and runtime layer.
NVIDIA GPU Operator: Kubernetes operator that manages GPU driver, container toolkit, device plugin, DCGM exporter, and related GPU components.
DCGM: Data Center GPU Manager, commonly used for GPU health and telemetry.
MIG: Multi-Instance GPU, hardware partitioning on supported NVIDIA GPUs.

Operational Metrics

Track at these layers:

Layer	Metrics
Request	RPS, error rate, p50/p95/p99 latency, timeout rate, queue time.
Model server	batch size, batch wait, inference time, model load failures, version status.
GPU	utilization, memory used/free, ECC errors, Xid errors, temperature, power, throttling.
Node	CPU, memory, disk, kernel logs, container runtime, driver health.
Scheduler	pending pods, unschedulable reasons, resource fragmentation, eviction events.
Deployment	canary health, rollback count, failed health checks, config validation failures.

Common Failure Modes

Pods pending because GPU resource requests cannot fit due to fragmentation or taints.
Model loads but fails under real request shape due to memory spikes.
p99 latency regresses after enabling batching or changing max batch delay.
GPU node reports Xid/ECC errors; workloads fail or silently slow down.
Driver/runtime/container toolkit mismatch after node image update.
Autoscaler adds nodes but model warmup is too slow to absorb burst.
Network or storage path bottlenecks model artifact download.
Health checks pass while the model is loaded but not actually producing valid responses.

Strong Debug Answer

Prompt: “p99 latency doubled after a rollout. What do you do?”

Answer frame:

Freeze or slow rollout and compare canary vs baseline.
Check request mix, traffic volume, error rate, queue time, and timeout rate.
Split server time into preprocessing, queue/batch wait, GPU compute, postprocessing.
Compare GPU utilization, memory, throttling, ECC/Xid, CPU saturation, and network/storage.
Inspect model version/config changes: precision, batch settings, max sequence length, runtime image, driver compatibility.
Roll back if SLO impact is active and evidence points to rollout.
Preserve data and create prevention: load test gate, canary metric, config diff, model warmup validation.

Senior/Staff Close

I would treat inference reliability as a contract between model teams and platform teams: model artifact, runtime image, resource envelope, health semantics, rollout policy, and observability requirements. Automation should enforce that contract before production traffic sees the model.