Skip to content

Inference Operations

An inference platform turns model artifacts plus hardware capacity into low-latency, reliable predictions. Operations must keep five things true:

  • The right model version is loaded and serving.
  • Requests meet latency and availability SLOs.
  • GPU capacity is allocated efficiently.
  • Rollouts and rollbacks are controlled.
  • Failures are isolated, diagnosed, and repaired quickly.
flowchart LR
  Product[Product request] --> Gateway[Gateway]
  Gateway --> Service[Inference service]
  Service --> Queue[Admission and queue]
  Queue --> Runtime[Model runtime]
  Runtime --> GPU[GPU execution]
  GPU --> Response[Response]

  Registry[Model registry] --> Runtime
  Observability[Metrics logs traces] -. observes .-> Gateway
  Observability -. observes .-> Queue
  Observability -. observes .-> GPU
ConceptInterview-ready explanation
Latency vs throughputInference optimizes both tail latency and tokens/images/requests per second. Batching can improve throughput but hurt p99 if not controlled.
Cold startModel load, compilation, graph optimization, cache warmup, and GPU memory allocation can make new replicas slow or temporarily unavailable.
Dynamic batchingGroups requests to improve GPU utilization. Must respect max wait time and request shape compatibility.
GPU memoryFailures often come from fragmentation, model size, KV cache growth, or multiple processes contending for memory.
Model rolloutNeeds artifact immutability, config validation, canarying, shadowing if possible, and fast rollback.
Multi-tenancyRequires quotas, isolation, priority, noisy-neighbor controls, and fair scheduling.

You do not need to pretend expertise you do not have, but you should know the vocabulary:

  • Triton Inference Server: NVIDIA model serving system supporting multiple backends, dynamic batching, model repository patterns, metrics, and concurrent model execution.
  • TensorRT: SDK/runtime for optimizing neural network inference on NVIDIA GPUs.
  • CUDA: GPU programming platform and runtime layer.
  • NVIDIA GPU Operator: Kubernetes operator that manages GPU driver, container toolkit, device plugin, DCGM exporter, and related GPU components.
  • DCGM: Data Center GPU Manager, commonly used for GPU health and telemetry.
  • MIG: Multi-Instance GPU, hardware partitioning on supported NVIDIA GPUs.

Track at these layers:

LayerMetrics
RequestRPS, error rate, p50/p95/p99 latency, timeout rate, queue time.
Model serverbatch size, batch wait, inference time, model load failures, version status.
GPUutilization, memory used/free, ECC errors, Xid errors, temperature, power, throttling.
NodeCPU, memory, disk, kernel logs, container runtime, driver health.
Schedulerpending pods, unschedulable reasons, resource fragmentation, eviction events.
Deploymentcanary health, rollback count, failed health checks, config validation failures.
  • Pods pending because GPU resource requests cannot fit due to fragmentation or taints.
  • Model loads but fails under real request shape due to memory spikes.
  • p99 latency regresses after enabling batching or changing max batch delay.
  • GPU node reports Xid/ECC errors; workloads fail or silently slow down.
  • Driver/runtime/container toolkit mismatch after node image update.
  • Autoscaler adds nodes but model warmup is too slow to absorb burst.
  • Network or storage path bottlenecks model artifact download.
  • Health checks pass while the model is loaded but not actually producing valid responses.

Prompt: “p99 latency doubled after a rollout. What do you do?”

Answer frame:

  1. Freeze or slow rollout and compare canary vs baseline.
  2. Check request mix, traffic volume, error rate, queue time, and timeout rate.
  3. Split server time into preprocessing, queue/batch wait, GPU compute, postprocessing.
  4. Compare GPU utilization, memory, throttling, ECC/Xid, CPU saturation, and network/storage.
  5. Inspect model version/config changes: precision, batch settings, max sequence length, runtime image, driver compatibility.
  6. Roll back if SLO impact is active and evidence points to rollout.
  7. Preserve data and create prevention: load test gate, canary metric, config diff, model warmup validation.

I would treat inference reliability as a contract between model teams and platform teams: model artifact, runtime image, resource envelope, health semantics, rollout policy, and observability requirements. Automation should enforce that contract before production traffic sees the model.