Skip to content

GPU Operator Runbook

NVIDIA’s GPU Operator automates the Kubernetes GPU software stack: drivers, NVIDIA Container Toolkit, device plugin, GPU Feature Discovery, DCGM monitoring, validators, and optional MIG management.

Interview thesis:

I prefer operator-managed GPU infrastructure because the risk is not only installing drivers; it is keeping driver, kernel, runtime, device plugin, telemetry, labels, and validation in a known-good state across a fleet.

ComponentRole
DriverKernel/user-space GPU driver enabling CUDA workloads.
NVIDIA Container ToolkitMakes GPUs available to containers through the runtime.
Device pluginAdvertises GPU resources to kubelet and allocates devices to pods.
GPU Feature DiscoveryLabels nodes with GPU and hardware capabilities.
DCGM exporterExposes GPU telemetry for monitoring.
MIG managerManages MIG partitioning on supported GPUs.
ValidatorsConfirm stack readiness after deployment.
flowchart TD
  GPUNode[GPU node] --> Driver[NVIDIA driver]
  Driver --> Runtime[NVIDIA container runtime]
  Runtime --> DevicePlugin[Device plugin]
  DevicePlugin --> APIServer[Kubernetes allocatable GPUs]
  GPUNode --> DCGM[DCGM exporter]
  GPUNode --> GFD[GPU Feature Discovery]
  GPUNode --> MIG[MIG manager]
  DCGM --> Prometheus[Prometheus]
  GFD --> Scheduler[Scheduler placement]
  MIG --> DevicePlugin

For a production fleet:

  1. Segment nodes by environment, SKU, kernel, OS image, and workload class.
  2. Roll out operator changes to a canary pool.
  3. Run validator pods and a representative CUDA/inference test workload.
  4. Gate on device plugin resources, DCGM metrics, workload readiness, and absence of Xid/ECC spikes.
  5. Promote pool by pool.
  6. Keep rollback plan for operator version, driver container, node image, and workload runtime image.

Checklist:

  • Node has expected GPU hardware visible at OS level.
  • Node labels identify NVIDIA PCI vendor and GPU SKU.
  • GPU Operator operands are scheduled on the node.
  • Driver pod is healthy.
  • Device plugin pod is healthy.
  • Kubelet reports nvidia.com/gpu allocatable.
  • No taints block operator operands.
  • Node is not explicitly labeled to skip GPU operands.

Checklist:

  • Pod requests nvidia.com/gpu.
  • Pod lands on GPU node.
  • Device plugin allocation succeeds.
  • Runtime class/toolkit exposes devices.
  • Container image CUDA version is compatible with driver.
  • nvidia-smi works in debug container or equivalent.
  • Application sees CUDA device and expected memory.

Do not blur these:

  • MIG: hardware-level partitioning on supported GPUs, stronger isolation and predictable slices.
  • Time-slicing: multiple workloads share a GPU over time; useful for low-duty workloads, weaker isolation.
  • MPS: CUDA Multi-Process Service; can improve sharing for compatible CUDA workloads.

Senior tradeoff:

For latency-sensitive inference, I would be careful with time-slicing unless the workload is proven compatible. MIG gives clearer capacity accounting and isolation where supported, but it introduces partition planning and can reduce flexibility if shapes are wrong.

Watch:

  • Xid errors.
  • ECC corrected/uncorrected errors.
  • GPU reset events.
  • DCGM health checks.
  • Temperature/power throttling.
  • Memory utilization and fragmentation symptoms.
  • PCIe/NVLink issues where relevant.
  • Workload-level failures correlated to node/GPU ID.

Healthy lifecycle:

  1. Provision node.
  2. Install/validate GPU stack.
  3. Label and taint appropriately.
  4. Run test workload.
  5. Admit production workloads.
  6. Continuously monitor.
  7. Quarantine on suspect health.
  8. Drain safely.
  9. Repair or escalate to hardware workflow.
  10. Validate before return.

Question: “After a driver update, half the GPU nodes no longer run inference pods.”

Answer:

  • Halt the rollout immediately.
  • Identify affected node pools, OS image, kernel, driver, operator version.
  • Check GPU Operator operand health and validator results.
  • Confirm device plugin is advertising GPUs.
  • Compare driver/CUDA compatibility with runtime images.
  • Inspect kubelet/container runtime logs and kernel Xid events.
  • Cordon affected nodes and shift traffic if needed.
  • Roll back driver/operator/node image depending on evidence.
  • Add a pre-promotion gate using a representative inference pod, DCGM checks, and version compatibility validation.