Kubernetes and GPUs

What Changes With GPUs

GPU workloads are not ordinary stateless pods:

Resources are expensive and scarce.
Device access depends on drivers, runtime hooks, device plugins, and node labels.
Failures can occur below Kubernetes visibility.
Warmup and model load time make rescheduling costly.
GPU fragmentation can strand capacity.

flowchart LR
  Pod[GPU pod requests nvidia.com/gpu] --> Scheduler[Scheduler filters nodes]
  Scheduler --> DevicePlugin[NVIDIA device plugin advertises capacity]
  DevicePlugin --> Kubelet[Kubelet allocates device]
  Kubelet --> Runtime[NVIDIA container runtime]
  Runtime --> Container[Container sees GPU libraries]
  Container --> GPU[GPU execution]

Scheduling Concepts

Concept	Why it matters
Extended resources	GPUs are advertised as resources like `nvidia.com/gpu`; the scheduler treats them as integer resources unless a plugin exposes finer partitioning.
Device plugin	Reports GPU devices to kubelet and injects allocation details into containers.
Taints/tolerations	Keep GPU nodes reserved for compatible workloads or quarantine unhealthy nodes.
Node labels	Encode SKU, driver, topology, MIG mode, region, pool, and workload eligibility.
Affinity	Places workloads near required hardware or away from conflicting tenants.
Pod disruption budget	Protects serving availability during voluntary disruption.
Priority class	Allows critical inference workloads to preempt lower-priority jobs if policy allows.

GPU Operator Pieces

Know the moving parts:

Driver daemonset
NVIDIA container toolkit
Kubernetes device plugin
DCGM exporter
Node feature discovery
MIG manager when applicable
Validator pods

Interview line:

If a GPU pod cannot start, I do not stop at kubectl describe pod. I verify the device plugin, runtime class, driver/toolkit compatibility, kubelet events, node labels/taints, and host-level GPU health.

Debug Drill: Pod Pending

Steps:

kubectl describe pod: unschedulable reason.
Check requested GPU count, memory, CPU, node selectors, affinity, tolerations.
Check node allocatable GPUs and current allocations.
Look for fragmentation: enough total GPUs, not enough on any single eligible node.
Validate labels and taints match intended pool.
Check autoscaler behavior and pending scale-up.
Decide: adjust placement, add capacity, drain/defragment, or fix labels/taints.

flowchart TD
  Pending[GPU pod pending] --> Events[kubectl describe events]
  Events --> Scarce{Insufficient GPU}
  Scarce -- yes --> Capacity[Check allocatable, fragmentation, autoscaler]
  Scarce -- no --> Constraints{Node selectors, affinity, taints}
  Constraints -- yes --> Placement[Fix placement contract]
  Constraints -- no --> Policy{Quota or admission}
  Policy -- yes --> Quota[Fix quota or policy]
  Policy -- no --> Scheduler[Inspect scheduler and device plugin health]

Debug Drill: Pod Starts But No GPU

Steps:

Confirm container sees devices: nvidia-smi inside container if permitted.
Check device plugin logs and kubelet allocation.
Validate NVIDIA container runtime/toolkit.
Check driver version vs CUDA/runtime image.
Confirm security context and runtime class do not block devices.
Inspect node health: dmesg, Xid, ECC, persistence mode if relevant.

Rollout Safety

For GPU fleet changes:

Segment by pool and hardware SKU.
Validate on canary nodes with representative workloads.
Gate on GPU telemetry and workload-level SLOs.
Keep rollback path for node image, driver, device plugin, and workload image.
Avoid simultaneous driver/runtime/model changes unless there is a strong reason.

Staff-Level Tradeoff

Question: “Would you pack GPU workloads densely or spread them?”

Answer:

Pack when utilization and cost dominate, and workloads are predictable.
Spread when blast-radius isolation, thermal/power risk, tenant isolation, or tail latency dominate.
Use policy per workload class rather than one global rule.
Measure stranded capacity, p99 latency, failure correlation, and reschedule cost.