Skip to content

Kubernetes and GPUs

GPU workloads are not ordinary stateless pods:

  • Resources are expensive and scarce.
  • Device access depends on drivers, runtime hooks, device plugins, and node labels.
  • Failures can occur below Kubernetes visibility.
  • Warmup and model load time make rescheduling costly.
  • GPU fragmentation can strand capacity.
flowchart LR
  Pod[GPU pod requests nvidia.com/gpu] --> Scheduler[Scheduler filters nodes]
  Scheduler --> DevicePlugin[NVIDIA device plugin advertises capacity]
  DevicePlugin --> Kubelet[Kubelet allocates device]
  Kubelet --> Runtime[NVIDIA container runtime]
  Runtime --> Container[Container sees GPU libraries]
  Container --> GPU[GPU execution]
ConceptWhy it matters
Extended resourcesGPUs are advertised as resources like nvidia.com/gpu; the scheduler treats them as integer resources unless a plugin exposes finer partitioning.
Device pluginReports GPU devices to kubelet and injects allocation details into containers.
Taints/tolerationsKeep GPU nodes reserved for compatible workloads or quarantine unhealthy nodes.
Node labelsEncode SKU, driver, topology, MIG mode, region, pool, and workload eligibility.
AffinityPlaces workloads near required hardware or away from conflicting tenants.
Pod disruption budgetProtects serving availability during voluntary disruption.
Priority classAllows critical inference workloads to preempt lower-priority jobs if policy allows.

Know the moving parts:

  • Driver daemonset
  • NVIDIA container toolkit
  • Kubernetes device plugin
  • DCGM exporter
  • Node feature discovery
  • MIG manager when applicable
  • Validator pods

Interview line:

If a GPU pod cannot start, I do not stop at kubectl describe pod. I verify the device plugin, runtime class, driver/toolkit compatibility, kubelet events, node labels/taints, and host-level GPU health.

Steps:

  1. kubectl describe pod: unschedulable reason.
  2. Check requested GPU count, memory, CPU, node selectors, affinity, tolerations.
  3. Check node allocatable GPUs and current allocations.
  4. Look for fragmentation: enough total GPUs, not enough on any single eligible node.
  5. Validate labels and taints match intended pool.
  6. Check autoscaler behavior and pending scale-up.
  7. Decide: adjust placement, add capacity, drain/defragment, or fix labels/taints.
flowchart TD
  Pending[GPU pod pending] --> Events[kubectl describe events]
  Events --> Scarce{Insufficient GPU}
  Scarce -- yes --> Capacity[Check allocatable, fragmentation, autoscaler]
  Scarce -- no --> Constraints{Node selectors, affinity, taints}
  Constraints -- yes --> Placement[Fix placement contract]
  Constraints -- no --> Policy{Quota or admission}
  Policy -- yes --> Quota[Fix quota or policy]
  Policy -- no --> Scheduler[Inspect scheduler and device plugin health]

Steps:

  1. Confirm container sees devices: nvidia-smi inside container if permitted.
  2. Check device plugin logs and kubelet allocation.
  3. Validate NVIDIA container runtime/toolkit.
  4. Check driver version vs CUDA/runtime image.
  5. Confirm security context and runtime class do not block devices.
  6. Inspect node health: dmesg, Xid, ECC, persistence mode if relevant.

For GPU fleet changes:

  • Segment by pool and hardware SKU.
  • Validate on canary nodes with representative workloads.
  • Gate on GPU telemetry and workload-level SLOs.
  • Keep rollback path for node image, driver, device plugin, and workload image.
  • Avoid simultaneous driver/runtime/model changes unless there is a strong reason.

Question: “Would you pack GPU workloads densely or spread them?”

Answer:

  • Pack when utilization and cost dominate, and workloads are predictable.
  • Spread when blast-radius isolation, thermal/power risk, tenant isolation, or tail latency dominate.
  • Use policy per workload class rather than one global rule.
  • Measure stranded capacity, p99 latency, failure correlation, and reschedule cost.