Kubernetes and GPU Q&A

1. Why can a pod remain pending when cluster dashboards show unused GPUs?

Dashboards often show aggregate capacity. The scheduler needs one eligible node that satisfies GPU count, SKU labels, taints, affinity, topology spread, CPU, memory, ephemeral storage, PVC binding, and quota.

2. What is the trap in saying "Kubernetes schedules GPUs"?

Kubernetes schedules extended resources advertised by a device plugin. The actual GPU stack depends on host drivers, container runtime hooks, device plugin health, and workload/runtime compatibility.

3. A node shows GPUs with `nvidia-smi`, but Kubernetes allocatable has zero GPUs. Where do you look?

Device plugin logs, kubelet plugin registration, GPU Operator operand health, node labels/taints, driver daemon readiness, container toolkit install, and kubelet restart/plugin socket issues.

4. Why is a GPU driver update more dangerous than a normal DaemonSet update?

It crosses kernel, driver, CUDA runtime, container toolkit, device plugin, telemetry, and workload compatibility. A bad rollout can remove expensive capacity or silently regress inference latency.

5. What is the safest rollout pattern for GPU Operator changes?

Canary by node pool/SKU, validate with operator validators and representative CUDA/inference workloads, gate on DCGM and service SLOs, then promote progressively with rollback for operator, driver, and node image.

6. How do you distinguish a scheduling failure from a startup failure?

Scheduling failures appear before node assignment and show unschedulable events. Startup failures have a node and fail during image pull, volume mount, runtime setup, container start, or probes.

7. What does a device plugin actually do?

It registers vendor-specific resources with kubelet, reports device health/capacity, and participates in allocation so containers receive the correct device access details.

7A. Walk through the API server write path for a Deployment update.

Client request hits authentication, authorization, mutating admission, schema/defaulting, validating admission, etcd persistence, then watch notifications. Controllers observe the new revision and reconcile ReplicaSets and pods.

7B. Why can existing pods serve while the Kubernetes control plane is degraded?

The data plane can keep forwarding to already-running pods while the control plane cannot accept writes or converge new state. Existing traffic depends on nodes, networking, gateways, and app health, not every API server write.

7C. What is the scheduler doing before a pod binds?

It queues the pod, filters infeasible nodes, scores feasible nodes, reserves resources, runs permit logic if configured, then binds the pod to a node for kubelet to start.

7D. Why can preemption fail to rescue a pending GPU pod?

Even if lower-priority victims exist, the exact GPU shape, node labels, affinity, topology, PDBs, and quota can still prevent a valid fit.

7E. What does kubelet own during pod startup?

Kubelet observes the assigned pod, pulls images, mounts volumes, asks CNI for networking, allocates devices through plugins, starts containers through CRI, runs probes, and reports pod status.

8. Why might a GPU pod fail only on one hardware SKU?

Model memory footprint, CUDA compute capability, MIG profile, driver branch, kernel/OS image, NVLink/PCIe topology, or labels causing accidental placement on unsupported devices.

9. What is GPU fragmentation in Kubernetes?

Enough GPUs exist across the fleet, but no eligible node has the exact available shape required by a pod or gang of pods.

10. How do you reduce GPU fragmentation?

Pool by workload shape, use placement policies, bin-pack predictable workloads, reserve capacity for large shapes, defragment through controlled drains, and model demand by SKU/profile.

11. Why is `kubectl top node` insufficient for GPU debugging?

It usually covers CPU/memory, not detailed GPU health, memory, ECC, Xid, power, throttling, or per-process usage. Use DCGM, driver logs, and workload metrics.

12. How do MIG and time-slicing differ operationally?

MIG provides hardware partitions with clearer isolation and accounting on supported GPUs. Time-slicing shares a GPU over time with weaker isolation and more latency/noisy-neighbor risk.

13. When would you avoid MIG?

When workload shapes change frequently, partition profiles would strand memory/compute, or latency/throughput is better with whole-GPU scheduling.

14. When would you avoid time-slicing?

For strict latency SLOs, untrusted tenants, memory-heavy workloads, or cases where interference cannot be bounded and proven through load tests.

15. A pod requests one GPU but sees two GPUs. What could be wrong?

Runtime/device visibility is bypassing Kubernetes allocation, privileged host access is leaking devices, container toolkit configuration is wrong, or app ignores CUDA_VISIBLE_DEVICES.

16. Why are startup probes important for inference pods?

Model load and warmup can take much longer than ordinary service startup. Without startup probes, liveness can kill a healthy pod before it becomes ready.

17. Why can readiness pass while inference is broken?

The readiness probe may check only process health or HTTP status, not model load, GPU execution, artifact version, or representative inference correctness.

18. What should a production inference readiness check prove?

That the intended model version is loaded, dependencies are reachable, the GPU path works, and the server can satisfy at least a lightweight representative request.

19. How can PDBs hurt emergency remediation?

They can block voluntary evictions during drains. That is a safety feature, so automation should treat it as capacity/SLO risk, not blindly force deletion.

20. What is a common mistake with taints on GPU nodes?

Adding a GPU-only taint but forgetting tolerations on GPU workloads or operator operands, causing either workloads or the GPU stack itself not to land.

21. What node labels matter for GPU fleets?

GPU SKU, GPU count, MIG mode/profile, driver version, CUDA compatibility, node pool, workload class, failure domain, topology, OS image, and operator readiness.

22. Why is node affinity dangerous if overused?

It can encode stale hardware assumptions, reduce scheduler flexibility, create hotspots, and make outages worse when a specific pool is degraded.

23. How do topology spread constraints help inference?

They reduce correlated failure by spreading replicas across zones, racks, pools, or nodes, but can worsen scheduling if capacity is tight.

24. What does `CrashLoopBackOff` not tell you?

It tells you restart behavior, not cause. Cause could be app config, model artifact, CUDA compatibility, GPU memory, missing secret, dependency, or probe failure.

25. How do you debug GPU memory OOM versus container memory OOM?

Container OOM appears in kubelet/container status and cgroup memory events. GPU memory OOM appears in framework/model logs, CUDA errors, DCGM/nvidia-smi, and may not kill the container.

26. Why can HPA be misleading for GPU inference?

CPU is often not the bottleneck. Better signals include queue time, request latency, GPU utilization, GPU memory, in-flight requests, and model server batching metrics.

27. Why does autoscaling not immediately fix inference overload?

Node provisioning, GPU operator readiness, image pull, model artifact download, model load, and warmup can exceed the traffic spike window.

28. How do warm pools help?

They keep validated nodes and sometimes warm replicas available so traffic can shift without waiting for hardware provisioning and model load.

29. What is Kueue useful for in GPU environments?

Quota-aware admission and fair sharing for queued workloads, especially batch inference, training, evaluation, and jobs that should not partially schedule.

30. How is Kueue different from HPA?

Kueue decides when workloads are admitted based on quota and queue policy. HPA changes replica count for running scalable workloads.

31. What is the staff-level answer to "Karpenter or Cluster Autoscaler"?

Choose based on provider, fleet shape, consolidation needs, operational maturity, and GPU readiness path. The larger issue is end-to-end capacity time including driver validation and model warmup.

32. How can a ValidatingAdmissionWebhook take down deploys?

If it has failurePolicy fail, bad certs, no endpoints, or slow responses, API writes can block cluster-wide for matching resources.

33. How do you make admission webhooks safer?

Tight selectors, short timeouts, HA endpoints, dashboards, alerting, staged rollout, tested cert rotation, and documented emergency bypass.

34. Why is a control-plane outage not always a serving outage?

Existing data-plane pods, services, and connections may keep serving. The outage may block scheduling, deploys, scaling, and controller reconciliation.

35. How do you respond differently to control-plane versus data-plane incidents?

Control-plane: freeze changes, preserve serving, avoid actions needing API writes. Data-plane: mitigate traffic, capacity, pod/node/network failures directly.

36. What is an EndpointSlice bug symptom?

Services route to missing, stale, or wrong endpoints; traffic fails despite healthy pods, or load balancing does not reflect readiness.

37. Why might a DaemonSet not run on a GPU node?

Node selectors, taints, tolerations, affinity, host OS mismatch, namespace policy, insufficient resources, or operator-specific skip labels.

38. What is the risk of running GPU support components as privileged pods?

They often need host access, so isolate them to system namespaces, restrict RBAC, enforce image provenance, and avoid granting similar privileges to application pods.

39. How do you validate a new GPU node before admitting production workloads?

Confirm driver/toolkit/device plugin/DCGM, run CUDA and representative inference tests, verify labels/taints, check telemetry, then uncordon or remove quarantine.

40. Why do model servers need startup and readiness tuned separately?

Startup handles long initialization without restarts. Readiness controls traffic admission after startup and during degraded states.

41. What is a bad liveness probe for inference?

One that kills pods during temporary dependency latency, model loading, or GPU warmup. Liveness should detect unrecoverable stuck states, not ordinary slowness.

42. How do you debug kubelet device allocation issues?

Inspect kubelet logs, device plugin registration socket, allocatable resources, pod events, plugin logs, and node-level runtime configuration.

43. Why can a pod be OOMKilled even when GPU memory is free?

CPU RAM and GPU memory are separate. Pre/post-processing, tokenization, artifact load, or host-side buffers may exceed container memory limits.

44. Why can a node with healthy kubelet still be bad for inference?

GPU hardware, driver, PCIe, NVLink, thermal, or DCGM health can be degraded while kubelet reports Ready.

45. What does "cordon then drain" protect?

Cordon prevents new pods from landing. Drain safely evicts existing workloads while respecting PDBs and workload disruption semantics.

46. What is the danger in force deleting pods on GPU nodes?

You can violate SLOs, lose evidence, corrupt local state, cause thundering-herd reschedules, or overload remaining GPU capacity.

47. How do you prevent GPU repair automation from thrashing?

Use state machines, cooldowns, confidence thresholds, max retries, escalation state, capacity checks, and global concurrency limits.

48. Why is `kubectl describe node` both useful and insufficient?

It shows allocatable resources, taints, conditions, and events, but not deep GPU health, kernel logs, runtime internals, or application-level symptoms.

49. What is the hard part of multi-tenant GPU scheduling?

Balancing utilization, fairness, SLO priority, isolation, quota, hardware shape, preemption policy, and tenant trust boundaries.

50. What answer shows you can operate GPUs at scale?

Discuss fleet segmentation, canaries, telemetry, resource envelopes, placement policy, capacity headroom, conservative automation, rollback, and post-incident prevention.