GPU Operator Runbook
What To Know
Section titled “What To Know”NVIDIA’s GPU Operator automates the Kubernetes GPU software stack: drivers, NVIDIA Container Toolkit, device plugin, GPU Feature Discovery, DCGM monitoring, validators, and optional MIG management.
Interview thesis:
I prefer operator-managed GPU infrastructure because the risk is not only installing drivers; it is keeping driver, kernel, runtime, device plugin, telemetry, labels, and validation in a known-good state across a fleet.
Core Components
Section titled “Core Components”| Component | Role |
|---|---|
| Driver | Kernel/user-space GPU driver enabling CUDA workloads. |
| NVIDIA Container Toolkit | Makes GPUs available to containers through the runtime. |
| Device plugin | Advertises GPU resources to kubelet and allocates devices to pods. |
| GPU Feature Discovery | Labels nodes with GPU and hardware capabilities. |
| DCGM exporter | Exposes GPU telemetry for monitoring. |
| MIG manager | Manages MIG partitioning on supported GPUs. |
| Validators | Confirm stack readiness after deployment. |
flowchart TD GPUNode[GPU node] --> Driver[NVIDIA driver] Driver --> Runtime[NVIDIA container runtime] Runtime --> DevicePlugin[Device plugin] DevicePlugin --> APIServer[Kubernetes allocatable GPUs] GPUNode --> DCGM[DCGM exporter] GPUNode --> GFD[GPU Feature Discovery] GPUNode --> MIG[MIG manager] DCGM --> Prometheus[Prometheus] GFD --> Scheduler[Scheduler placement] MIG --> DevicePlugin
Install/Rollout Strategy
Section titled “Install/Rollout Strategy”For a production fleet:
- Segment nodes by environment, SKU, kernel, OS image, and workload class.
- Roll out operator changes to a canary pool.
- Run validator pods and a representative CUDA/inference test workload.
- Gate on device plugin resources, DCGM metrics, workload readiness, and absence of Xid/ECC spikes.
- Promote pool by pool.
- Keep rollback plan for operator version, driver container, node image, and workload runtime image.
Debug: GPUs Not Advertised
Section titled “Debug: GPUs Not Advertised”Checklist:
- Node has expected GPU hardware visible at OS level.
- Node labels identify NVIDIA PCI vendor and GPU SKU.
- GPU Operator operands are scheduled on the node.
- Driver pod is healthy.
- Device plugin pod is healthy.
- Kubelet reports
nvidia.com/gpuallocatable. - No taints block operator operands.
- Node is not explicitly labeled to skip GPU operands.
Debug: Workload Cannot Use GPU
Section titled “Debug: Workload Cannot Use GPU”Checklist:
- Pod requests
nvidia.com/gpu. - Pod lands on GPU node.
- Device plugin allocation succeeds.
- Runtime class/toolkit exposes devices.
- Container image CUDA version is compatible with driver.
nvidia-smiworks in debug container or equivalent.- Application sees CUDA device and expected memory.
MIG, Time-Slicing, MPS
Section titled “MIG, Time-Slicing, MPS”Do not blur these:
- MIG: hardware-level partitioning on supported GPUs, stronger isolation and predictable slices.
- Time-slicing: multiple workloads share a GPU over time; useful for low-duty workloads, weaker isolation.
- MPS: CUDA Multi-Process Service; can improve sharing for compatible CUDA workloads.
Senior tradeoff:
For latency-sensitive inference, I would be careful with time-slicing unless the workload is proven compatible. MIG gives clearer capacity accounting and isolation where supported, but it introduces partition planning and can reduce flexibility if shapes are wrong.
GPU Health Signals
Section titled “GPU Health Signals”Watch:
- Xid errors.
- ECC corrected/uncorrected errors.
- GPU reset events.
- DCGM health checks.
- Temperature/power throttling.
- Memory utilization and fragmentation symptoms.
- PCIe/NVLink issues where relevant.
- Workload-level failures correlated to node/GPU ID.
Node Lifecycle
Section titled “Node Lifecycle”Healthy lifecycle:
- Provision node.
- Install/validate GPU stack.
- Label and taint appropriately.
- Run test workload.
- Admit production workloads.
- Continuously monitor.
- Quarantine on suspect health.
- Drain safely.
- Repair or escalate to hardware workflow.
- Validate before return.
Interview Drill
Section titled “Interview Drill”Question: “After a driver update, half the GPU nodes no longer run inference pods.”
Answer:
- Halt the rollout immediately.
- Identify affected node pools, OS image, kernel, driver, operator version.
- Check GPU Operator operand health and validator results.
- Confirm device plugin is advertising GPUs.
- Compare driver/CUDA compatibility with runtime images.
- Inspect kubelet/container runtime logs and kernel Xid events.
- Cordon affected nodes and shift traffic if needed.
- Roll back driver/operator/node image depending on evidence.
- Add a pre-promotion gate using a representative inference pod, DCGM checks, and version compatibility validation.