Kubernetes Drill Bank

Use these as live practice. Say your hypothesis before each command.

Drill 1: Pod Pending With GPUs Available

Prompt:

A model server pod is pending. The dashboard says the cluster has unused GPUs.

Commands:

kubectl describe pod -n inference <pod>
kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/mig.strategy
kubectl describe node <node>
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl get events -A --sort-by=.lastTimestamp | tail -50

What you are checking:

Unschedulable reason.
GPU request shape.
Node selector and affinity.
Taints and tolerations.
Per-node allocatable GPUs, not aggregate fleet GPUs.
Fragmentation or wrong SKU.
Quota/LimitRange.

Senior close:

Aggregate free GPU count is a misleading metric. Scheduling needs an eligible node with the exact resource shape, labels, taints, and topology constraints.

Drill 2: GPU Pod Starts But CUDA Is Unavailable

Commands:

kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
kubectl describe node <node> | grep -A8 -i allocatable
kubectl exec -n inference <pod> -- nvidia-smi
kubectl get runtimeclass

What you are checking:

Device plugin health.
Driver daemon health.
Container toolkit/runtime path.
Driver/CUDA compatibility.
Whether Kubernetes allocated a GPU to the pod.
Whether the workload image expects a newer CUDA capability than the host driver supports.

Drill 3: CrashLoop After Runtime Image Change

Commands:

kubectl rollout history deploy/<deploy> -n inference
kubectl describe pod -n inference <pod>
kubectl logs -n inference <pod> --previous
kubectl get rs -n inference -o wide
kubectl diff -f rendered-manifest.yaml

Hypotheses:

Bad image.
Missing env/secret/config.
CUDA/driver mismatch.
Model artifact path changed.
Startup probe too aggressive.
GPU memory allocation failure.

Mitigation:

Pause rollout.
Roll back if user impact.
Keep failed pod evidence.
Add startup validation and compatibility gate.

Drill 4: p99 Latency Spike With No Error Spike

Commands and sources:

kubectl top pods -n inference
kubectl top nodes
kubectl get hpa -n inference
kubectl get --raw /metrics | grep scheduler_pending

Also inspect:

Model server queue time.
Batch wait.
GPU utilization and memory.
Request mix.
Recent deploys.
Network and downstream latency.

Senior answer:

A latency-only incident often means saturation, queuing, batching, warmup, or dependency latency. I would split total request time into queue, compute, network, and downstream before changing capacity or batch settings.

Drill 5: Node Drain Hangs

Commands:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=server
kubectl get pdb -A
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl describe pod -n <ns> <pod>

Causes:

PDB blocks eviction.
Pod uses local storage.
DaemonSet pod ignored.
Finalizer stuck.
Workload has no safe replacement capacity.

Senior close:

Drain automation must treat a blocked PDB as a safety signal, not an obstacle to force through during normal operations.

Drill 6: Gateway Route Sends Traffic To Wrong Version

Commands:

kubectl get gateway,httproute,grpcroute -A
kubectl describe httproute -n <ns> <route>
kubectl get svc,endpointslices -n <ns>
kubectl logs -n <gateway-ns> deploy/<gateway-controller>

Check:

Route attachment.
Hostname and path match.
Backend refs and weights.
Cross-namespace reference grants.
Service selectors.
Endpoint readiness.

Drill 7: CNI Or Network Policy Drop

Commands:

kubectl get networkpolicy -A
kubectl describe networkpolicy -n <ns> <policy>
kubectl exec -n <ns> <pod> -- curl -v http://service:port/health
kubectl exec -n <ns> <pod> -- nslookup service.namespace.svc.cluster.local

If Cilium/Hubble is available:

cilium status
cilium connectivity test
hubble observe --namespace <ns>

Explain:

DNS success does not imply TCP success.
Service success does not imply pod-to-pod policy success.
Policy can block egress, ingress, or DNS.

Drill 8: GitOps Drift

Commands:

argocd app get <app>
argocd app diff <app>
flux get kustomizations -A
flux get helmreleases -A
kubectl get events -n <app-ns> --sort-by=.lastTimestamp

Strong answer:

I would not fight the reconciler. If the live change is correct, it needs to be committed to source of truth or the controller needs an intentional, audited pause.

Drill 9: Admission Webhook Blocking Deploys

Commands:

kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get apiservices
kubectl get events -A --sort-by=.lastTimestamp | tail -50

Check:

Webhook service endpoints.
Failure policy.
Timeout.
Certificate expiry.
Namespace/object selectors.

Senior note:

Admission controls protect production, but a failing webhook can become a cluster-wide deploy outage. Critical webhooks need SLOs, dashboards, timeout discipline, and emergency bypass policy.

Drill 10: GPU Node Quarantine Decision

Signals:

Repeated Xid errors.
Uncorrected ECC.
Multiple workload failures on same GPU ID.
DCGM unhealthy.
Thermal/power throttling.
Device plugin flapping.

Action plan:

Mark node suspect.
Confirm capacity headroom.
Cordon.
Drain respecting PDBs.
Run diagnostics or reboot.
Validate with GPU test workload.
Return or escalate to hardware repair.

Senior close:

The automation should be conservative until it has enough evidence. Recommendation mode is a good first step for risky hardware actions.

Drill 11: API Server Slow But Inference Still Serves

Prompt:

Deploys are timing out, kubectl is slow, but existing inference traffic is mostly healthy.

Commands:

kubectl get --raw='/readyz?verbose'
kubectl get --raw='/livez?verbose'
kubectl get --raw='/metrics' | grep apiserver_request
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl get events -A --sort-by=.lastTimestamp | tail -80

What you are checking:

Read vs write path.
Admission webhook latency/failures.
API priority and fairness queueing.
Controller/watch lag.
Whether user traffic depends on the failing control-plane path.

Senior close:

I would separate serving impact from operations impact. If existing pods and gateways serve, I freeze nonessential deploys and reduce control-plane write load before touching data-plane capacity.

Drill 12: Etcd Latency Or Quorum Risk

Prompt:

Control-plane alerts show etcd fsync latency and API server write timeouts.

Commands:

kubectl get --raw='/metrics' | grep -E 'etcd|apiserver_storage'
kubectl get leases -A | head
kubectl get events -A --sort-by=.lastTimestamp | tail -80
kubectl get nodes

If you have control-plane host access, inspect etcd member health, leader changes, disk latency, database size, compaction, and defrag history.

Do not:

Restart every control-plane component blindly.
Run risky fleet-wide deploys during degraded persistence.
Assume a successful read means writes are safe.

Senior close:

Etcd incidents are about protecting quorum and write durability. I would lower write pressure, verify data-plane health, inspect disk and member health, and only consider restore from a tested snapshot path.

Drill 13: Admission Webhook Timeout During Rollout

Prompt:

Every new Deployment update hangs with admission timeout errors.

Commands:

kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get svc,endpointslices -n <webhook-ns>
kubectl logs -n <webhook-ns> deploy/<webhook>
kubectl get events -A --sort-by=.lastTimestamp | tail -80

Check:

Webhook service endpoints.
Timeout seconds.
Failure policy.
Certificate validity.
Namespace/object selectors.
Whether the webhook validates critical resources broadly.

Senior close:

A policy system can become a production dependency. For critical admission, I want health checks, SLOs, conservative timeouts, rollout gates, and an audited emergency bypass.

Drill 14: EndpointSlice Or Readiness Lag

Prompt:

Pods are Ready, but the Gateway or Service still sends traffic to old endpoints or returns intermittent 503s.

Commands:

kubectl get pods -n inference -o wide
kubectl get svc,endpointslices -n inference -o wide
kubectl describe endpointslice -n inference <slice>
kubectl get events -n inference --sort-by=.lastTimestamp
kubectl describe httproute -n inference <route>

Hypotheses:

Selector mismatch.
Readiness gate not what you think.
EndpointSlice controller lag.
Route points at the wrong Service.
Gateway cached stale backend state.
Pods flap readiness during model warmup.

Senior close:

I would avoid saying “the pod is Ready so networking is broken.” Readiness must propagate through EndpointSlice, Service, route, gateway, and client connection behavior.

Drill 15: Terraform State Wants To Destroy Production

Prompt:

CI shows a plan to destroy and recreate production infrastructure after a module rename.

Commands:

terraform plan -out=tfplan
terraform show -json tfplan
terraform state list
terraform state show <address>
terraform providers

Check:

Module path/address changed.
count index churn.
for_each key changed.
Provider alias changed.
Backend/workspace mismatch.
Missing moved block.
Imported resource config mismatch.

Senior close:

I treat unexpected destroy as a change-blocker. The fix is usually address reconciliation through moved blocks or reviewed state movement, not approving a destructive plan because “the module was only refactored.”