GPU Operator Runbook

What To Know

NVIDIA’s GPU Operator automates the Kubernetes GPU software stack: drivers, NVIDIA Container Toolkit, device plugin, GPU Feature Discovery, DCGM monitoring, validators, and optional MIG management.

Interview thesis:

I prefer operator-managed GPU infrastructure because the risk is not only installing drivers; it is keeping driver, kernel, runtime, device plugin, telemetry, labels, and validation in a known-good state across a fleet.

Core Components

Component	Role
Driver	Kernel/user-space GPU driver enabling CUDA workloads.
NVIDIA Container Toolkit	Makes GPUs available to containers through the runtime.
Device plugin	Advertises GPU resources to kubelet and allocates devices to pods.
GPU Feature Discovery	Labels nodes with GPU and hardware capabilities.
DCGM exporter	Exposes GPU telemetry for monitoring.
MIG manager	Manages MIG partitioning on supported GPUs.
Validators	Confirm stack readiness after deployment.

flowchart TD
  GPUNode[GPU node] --> Driver[NVIDIA driver]
  Driver --> Runtime[NVIDIA container runtime]
  Runtime --> DevicePlugin[Device plugin]
  DevicePlugin --> APIServer[Kubernetes allocatable GPUs]
  GPUNode --> DCGM[DCGM exporter]
  GPUNode --> GFD[GPU Feature Discovery]
  GPUNode --> MIG[MIG manager]
  DCGM --> Prometheus[Prometheus]
  GFD --> Scheduler[Scheduler placement]
  MIG --> DevicePlugin

Install/Rollout Strategy

For a production fleet:

Segment nodes by environment, SKU, kernel, OS image, and workload class.
Roll out operator changes to a canary pool.
Run validator pods and a representative CUDA/inference test workload.
Gate on device plugin resources, DCGM metrics, workload readiness, and absence of Xid/ECC spikes.
Promote pool by pool.
Keep rollback plan for operator version, driver container, node image, and workload runtime image.

Debug: GPUs Not Advertised

Checklist:

Node has expected GPU hardware visible at OS level.
Node labels identify NVIDIA PCI vendor and GPU SKU.
GPU Operator operands are scheduled on the node.
Driver pod is healthy.
Device plugin pod is healthy.
Kubelet reports nvidia.com/gpu allocatable.
No taints block operator operands.
Node is not explicitly labeled to skip GPU operands.

Debug: Workload Cannot Use GPU

Checklist:

Pod requests nvidia.com/gpu.
Pod lands on GPU node.
Device plugin allocation succeeds.
Runtime class/toolkit exposes devices.
Container image CUDA version is compatible with driver.
nvidia-smi works in debug container or equivalent.
Application sees CUDA device and expected memory.

MIG, Time-Slicing, MPS

Do not blur these:

MIG: hardware-level partitioning on supported GPUs, stronger isolation and predictable slices.
Time-slicing: multiple workloads share a GPU over time; useful for low-duty workloads, weaker isolation.
MPS: CUDA Multi-Process Service; can improve sharing for compatible CUDA workloads.

Senior tradeoff:

For latency-sensitive inference, I would be careful with time-slicing unless the workload is proven compatible. MIG gives clearer capacity accounting and isolation where supported, but it introduces partition planning and can reduce flexibility if shapes are wrong.

GPU Health Signals

Watch:

Xid errors.
ECC corrected/uncorrected errors.
GPU reset events.
DCGM health checks.
Temperature/power throttling.
Memory utilization and fragmentation symptoms.
PCIe/NVLink issues where relevant.
Workload-level failures correlated to node/GPU ID.

Node Lifecycle

Healthy lifecycle:

Provision node.
Install/validate GPU stack.
Label and taint appropriately.
Run test workload.
Admit production workloads.
Continuously monitor.
Quarantine on suspect health.
Drain safely.
Repair or escalate to hardware workflow.
Validate before return.

Interview Drill

Question: “After a driver update, half the GPU nodes no longer run inference pods.”

Answer:

Halt the rollout immediately.
Identify affected node pools, OS image, kernel, driver, operator version.
Check GPU Operator operand health and validator results.
Confirm device plugin is advertising GPUs.
Compare driver/CUDA compatibility with runtime images.
Inspect kubelet/container runtime logs and kernel Xid events.
Cordon affected nodes and shift traffic if needed.
Roll back driver/operator/node image depending on evidence.
Add a pre-promotion gate using a representative inference pod, DCGM checks, and version compatibility validation.