Kubernetes and GPUs
What Changes With GPUs
Section titled “What Changes With GPUs”GPU workloads are not ordinary stateless pods:
- Resources are expensive and scarce.
- Device access depends on drivers, runtime hooks, device plugins, and node labels.
- Failures can occur below Kubernetes visibility.
- Warmup and model load time make rescheduling costly.
- GPU fragmentation can strand capacity.
flowchart LR Pod[GPU pod requests nvidia.com/gpu] --> Scheduler[Scheduler filters nodes] Scheduler --> DevicePlugin[NVIDIA device plugin advertises capacity] DevicePlugin --> Kubelet[Kubelet allocates device] Kubelet --> Runtime[NVIDIA container runtime] Runtime --> Container[Container sees GPU libraries] Container --> GPU[GPU execution]
Scheduling Concepts
Section titled “Scheduling Concepts”| Concept | Why it matters |
|---|---|
| Extended resources | GPUs are advertised as resources like nvidia.com/gpu; the scheduler treats them as integer resources unless a plugin exposes finer partitioning. |
| Device plugin | Reports GPU devices to kubelet and injects allocation details into containers. |
| Taints/tolerations | Keep GPU nodes reserved for compatible workloads or quarantine unhealthy nodes. |
| Node labels | Encode SKU, driver, topology, MIG mode, region, pool, and workload eligibility. |
| Affinity | Places workloads near required hardware or away from conflicting tenants. |
| Pod disruption budget | Protects serving availability during voluntary disruption. |
| Priority class | Allows critical inference workloads to preempt lower-priority jobs if policy allows. |
GPU Operator Pieces
Section titled “GPU Operator Pieces”Know the moving parts:
- Driver daemonset
- NVIDIA container toolkit
- Kubernetes device plugin
- DCGM exporter
- Node feature discovery
- MIG manager when applicable
- Validator pods
Interview line:
If a GPU pod cannot start, I do not stop at
kubectl describe pod. I verify the device plugin, runtime class, driver/toolkit compatibility, kubelet events, node labels/taints, and host-level GPU health.
Debug Drill: Pod Pending
Section titled “Debug Drill: Pod Pending”Steps:
kubectl describe pod: unschedulable reason.- Check requested GPU count, memory, CPU, node selectors, affinity, tolerations.
- Check node allocatable GPUs and current allocations.
- Look for fragmentation: enough total GPUs, not enough on any single eligible node.
- Validate labels and taints match intended pool.
- Check autoscaler behavior and pending scale-up.
- Decide: adjust placement, add capacity, drain/defragment, or fix labels/taints.
flowchart TD
Pending[GPU pod pending] --> Events[kubectl describe events]
Events --> Scarce{Insufficient GPU}
Scarce -- yes --> Capacity[Check allocatable, fragmentation, autoscaler]
Scarce -- no --> Constraints{Node selectors, affinity, taints}
Constraints -- yes --> Placement[Fix placement contract]
Constraints -- no --> Policy{Quota or admission}
Policy -- yes --> Quota[Fix quota or policy]
Policy -- no --> Scheduler[Inspect scheduler and device plugin health]
Debug Drill: Pod Starts But No GPU
Section titled “Debug Drill: Pod Starts But No GPU”Steps:
- Confirm container sees devices:
nvidia-smiinside container if permitted. - Check device plugin logs and kubelet allocation.
- Validate NVIDIA container runtime/toolkit.
- Check driver version vs CUDA/runtime image.
- Confirm security context and runtime class do not block devices.
- Inspect node health:
dmesg, Xid, ECC, persistence mode if relevant.
Rollout Safety
Section titled “Rollout Safety”For GPU fleet changes:
- Segment by pool and hardware SKU.
- Validate on canary nodes with representative workloads.
- Gate on GPU telemetry and workload-level SLOs.
- Keep rollback path for node image, driver, device plugin, and workload image.
- Avoid simultaneous driver/runtime/model changes unless there is a strong reason.
Staff-Level Tradeoff
Section titled “Staff-Level Tradeoff”Question: “Would you pack GPU workloads densely or spread them?”
Answer:
- Pack when utilization and cost dominate, and workloads are predictable.
- Spread when blast-radius isolation, thermal/power risk, tenant isolation, or tail latency dominate.
- Use policy per workload class rather than one global rule.
- Measure stranded capacity, p99 latency, failure correlation, and reschedule cost.