Kubernetes Drill Bank
Use these as live practice. Say your hypothesis before each command.
Drill 1: Pod Pending With GPUs Available
Section titled “Drill 1: Pod Pending With GPUs Available”Prompt:
A model server pod is pending. The dashboard says the cluster has unused GPUs.
Commands:
kubectl describe pod -n inference <pod>
kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/mig.strategy
kubectl describe node <node>
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl get events -A --sort-by=.lastTimestamp | tail -50
What you are checking:
- Unschedulable reason.
- GPU request shape.
- Node selector and affinity.
- Taints and tolerations.
- Per-node allocatable GPUs, not aggregate fleet GPUs.
- Fragmentation or wrong SKU.
- Quota/LimitRange.
Senior close:
Aggregate free GPU count is a misleading metric. Scheduling needs an eligible node with the exact resource shape, labels, taints, and topology constraints.
Drill 2: GPU Pod Starts But CUDA Is Unavailable
Section titled “Drill 2: GPU Pod Starts But CUDA Is Unavailable”Commands:
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
kubectl describe node <node> | grep -A8 -i allocatable
kubectl exec -n inference <pod> -- nvidia-smi
kubectl get runtimeclass
What you are checking:
- Device plugin health.
- Driver daemon health.
- Container toolkit/runtime path.
- Driver/CUDA compatibility.
- Whether Kubernetes allocated a GPU to the pod.
- Whether the workload image expects a newer CUDA capability than the host driver supports.
Drill 3: CrashLoop After Runtime Image Change
Section titled “Drill 3: CrashLoop After Runtime Image Change”Commands:
kubectl rollout history deploy/<deploy> -n inference
kubectl describe pod -n inference <pod>
kubectl logs -n inference <pod> --previous
kubectl get rs -n inference -o wide
kubectl diff -f rendered-manifest.yaml
Hypotheses:
- Bad image.
- Missing env/secret/config.
- CUDA/driver mismatch.
- Model artifact path changed.
- Startup probe too aggressive.
- GPU memory allocation failure.
Mitigation:
- Pause rollout.
- Roll back if user impact.
- Keep failed pod evidence.
- Add startup validation and compatibility gate.
Drill 4: p99 Latency Spike With No Error Spike
Section titled “Drill 4: p99 Latency Spike With No Error Spike”Commands and sources:
kubectl top pods -n inference
kubectl top nodes
kubectl get hpa -n inference
kubectl get --raw /metrics | grep scheduler_pending
Also inspect:
- Model server queue time.
- Batch wait.
- GPU utilization and memory.
- Request mix.
- Recent deploys.
- Network and downstream latency.
Senior answer:
A latency-only incident often means saturation, queuing, batching, warmup, or dependency latency. I would split total request time into queue, compute, network, and downstream before changing capacity or batch settings.
Drill 5: Node Drain Hangs
Section titled “Drill 5: Node Drain Hangs”Commands:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=server
kubectl get pdb -A
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl describe pod -n <ns> <pod>
Causes:
- PDB blocks eviction.
- Pod uses local storage.
- DaemonSet pod ignored.
- Finalizer stuck.
- Workload has no safe replacement capacity.
Senior close:
Drain automation must treat a blocked PDB as a safety signal, not an obstacle to force through during normal operations.
Drill 6: Gateway Route Sends Traffic To Wrong Version
Section titled “Drill 6: Gateway Route Sends Traffic To Wrong Version”Commands:
kubectl get gateway,httproute,grpcroute -A
kubectl describe httproute -n <ns> <route>
kubectl get svc,endpointslices -n <ns>
kubectl logs -n <gateway-ns> deploy/<gateway-controller>
Check:
- Route attachment.
- Hostname and path match.
- Backend refs and weights.
- Cross-namespace reference grants.
- Service selectors.
- Endpoint readiness.
Drill 7: CNI Or Network Policy Drop
Section titled “Drill 7: CNI Or Network Policy Drop”Commands:
kubectl get networkpolicy -A
kubectl describe networkpolicy -n <ns> <policy>
kubectl exec -n <ns> <pod> -- curl -v http://service:port/health
kubectl exec -n <ns> <pod> -- nslookup service.namespace.svc.cluster.local
If Cilium/Hubble is available:
cilium status
cilium connectivity test
hubble observe --namespace <ns>
Explain:
- DNS success does not imply TCP success.
- Service success does not imply pod-to-pod policy success.
- Policy can block egress, ingress, or DNS.
Drill 8: GitOps Drift
Section titled “Drill 8: GitOps Drift”Commands:
argocd app get <app>
argocd app diff <app>
flux get kustomizations -A
flux get helmreleases -A
kubectl get events -n <app-ns> --sort-by=.lastTimestamp
Strong answer:
I would not fight the reconciler. If the live change is correct, it needs to be committed to source of truth or the controller needs an intentional, audited pause.
Drill 9: Admission Webhook Blocking Deploys
Section titled “Drill 9: Admission Webhook Blocking Deploys”Commands:
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get apiservices
kubectl get events -A --sort-by=.lastTimestamp | tail -50
Check:
- Webhook service endpoints.
- Failure policy.
- Timeout.
- Certificate expiry.
- Namespace/object selectors.
Senior note:
Admission controls protect production, but a failing webhook can become a cluster-wide deploy outage. Critical webhooks need SLOs, dashboards, timeout discipline, and emergency bypass policy.
Drill 10: GPU Node Quarantine Decision
Section titled “Drill 10: GPU Node Quarantine Decision”Signals:
- Repeated Xid errors.
- Uncorrected ECC.
- Multiple workload failures on same GPU ID.
- DCGM unhealthy.
- Thermal/power throttling.
- Device plugin flapping.
Action plan:
- Mark node suspect.
- Confirm capacity headroom.
- Cordon.
- Drain respecting PDBs.
- Run diagnostics or reboot.
- Validate with GPU test workload.
- Return or escalate to hardware repair.
Senior close:
The automation should be conservative until it has enough evidence. Recommendation mode is a good first step for risky hardware actions.
Drill 11: API Server Slow But Inference Still Serves
Section titled “Drill 11: API Server Slow But Inference Still Serves”Prompt:
Deploys are timing out,
kubectlis slow, but existing inference traffic is mostly healthy.
Commands:
kubectl get --raw='/readyz?verbose'
kubectl get --raw='/livez?verbose'
kubectl get --raw='/metrics' | grep apiserver_request
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl get events -A --sort-by=.lastTimestamp | tail -80
What you are checking:
- Read vs write path.
- Admission webhook latency/failures.
- API priority and fairness queueing.
- Controller/watch lag.
- Whether user traffic depends on the failing control-plane path.
Senior close:
I would separate serving impact from operations impact. If existing pods and gateways serve, I freeze nonessential deploys and reduce control-plane write load before touching data-plane capacity.
Drill 12: Etcd Latency Or Quorum Risk
Section titled “Drill 12: Etcd Latency Or Quorum Risk”Prompt:
Control-plane alerts show etcd fsync latency and API server write timeouts.
Commands:
kubectl get --raw='/metrics' | grep -E 'etcd|apiserver_storage'
kubectl get leases -A | head
kubectl get events -A --sort-by=.lastTimestamp | tail -80
kubectl get nodes
If you have control-plane host access, inspect etcd member health, leader changes, disk latency, database size, compaction, and defrag history.
Do not:
- Restart every control-plane component blindly.
- Run risky fleet-wide deploys during degraded persistence.
- Assume a successful read means writes are safe.
Senior close:
Etcd incidents are about protecting quorum and write durability. I would lower write pressure, verify data-plane health, inspect disk and member health, and only consider restore from a tested snapshot path.
Drill 13: Admission Webhook Timeout During Rollout
Section titled “Drill 13: Admission Webhook Timeout During Rollout”Prompt:
Every new Deployment update hangs with admission timeout errors.
Commands:
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get svc,endpointslices -n <webhook-ns>
kubectl logs -n <webhook-ns> deploy/<webhook>
kubectl get events -A --sort-by=.lastTimestamp | tail -80
Check:
- Webhook service endpoints.
- Timeout seconds.
- Failure policy.
- Certificate validity.
- Namespace/object selectors.
- Whether the webhook validates critical resources broadly.
Senior close:
A policy system can become a production dependency. For critical admission, I want health checks, SLOs, conservative timeouts, rollout gates, and an audited emergency bypass.
Drill 14: EndpointSlice Or Readiness Lag
Section titled “Drill 14: EndpointSlice Or Readiness Lag”Prompt:
Pods are Ready, but the Gateway or Service still sends traffic to old endpoints or returns intermittent 503s.
Commands:
kubectl get pods -n inference -o wide
kubectl get svc,endpointslices -n inference -o wide
kubectl describe endpointslice -n inference <slice>
kubectl get events -n inference --sort-by=.lastTimestamp
kubectl describe httproute -n inference <route>
Hypotheses:
- Selector mismatch.
- Readiness gate not what you think.
- EndpointSlice controller lag.
- Route points at the wrong Service.
- Gateway cached stale backend state.
- Pods flap readiness during model warmup.
Senior close:
I would avoid saying “the pod is Ready so networking is broken.” Readiness must propagate through EndpointSlice, Service, route, gateway, and client connection behavior.
Drill 15: Terraform State Wants To Destroy Production
Section titled “Drill 15: Terraform State Wants To Destroy Production”Prompt:
CI shows a plan to destroy and recreate production infrastructure after a module rename.
Commands:
terraform plan -out=tfplan
terraform show -json tfplan
terraform state list
terraform state show <address>
terraform providers
Check:
- Module path/address changed.
countindex churn.for_eachkey changed.- Provider alias changed.
- Backend/workspace mismatch.
- Missing
movedblock. - Imported resource config mismatch.
Senior close:
I treat unexpected destroy as a change-blocker. The fix is usually address reconciliation through
movedblocks or reviewed state movement, not approving a destructive plan because “the module was only refactored.”