Skip to content

Kubernetes Drill Bank

Use these as live practice. Say your hypothesis before each command.

Prompt:

A model server pod is pending. The dashboard says the cluster has unused GPUs.

Commands:

kubectl describe pod -n inference <pod>
kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/mig.strategy
kubectl describe node <node>
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl get events -A --sort-by=.lastTimestamp | tail -50

What you are checking:

  • Unschedulable reason.
  • GPU request shape.
  • Node selector and affinity.
  • Taints and tolerations.
  • Per-node allocatable GPUs, not aggregate fleet GPUs.
  • Fragmentation or wrong SKU.
  • Quota/LimitRange.

Senior close:

Aggregate free GPU count is a misleading metric. Scheduling needs an eligible node with the exact resource shape, labels, taints, and topology constraints.

Drill 2: GPU Pod Starts But CUDA Is Unavailable

Section titled “Drill 2: GPU Pod Starts But CUDA Is Unavailable”

Commands:

kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
kubectl describe node <node> | grep -A8 -i allocatable
kubectl exec -n inference <pod> -- nvidia-smi
kubectl get runtimeclass

What you are checking:

  • Device plugin health.
  • Driver daemon health.
  • Container toolkit/runtime path.
  • Driver/CUDA compatibility.
  • Whether Kubernetes allocated a GPU to the pod.
  • Whether the workload image expects a newer CUDA capability than the host driver supports.

Drill 3: CrashLoop After Runtime Image Change

Section titled “Drill 3: CrashLoop After Runtime Image Change”

Commands:

kubectl rollout history deploy/<deploy> -n inference
kubectl describe pod -n inference <pod>
kubectl logs -n inference <pod> --previous
kubectl get rs -n inference -o wide
kubectl diff -f rendered-manifest.yaml

Hypotheses:

  • Bad image.
  • Missing env/secret/config.
  • CUDA/driver mismatch.
  • Model artifact path changed.
  • Startup probe too aggressive.
  • GPU memory allocation failure.

Mitigation:

  • Pause rollout.
  • Roll back if user impact.
  • Keep failed pod evidence.
  • Add startup validation and compatibility gate.

Drill 4: p99 Latency Spike With No Error Spike

Section titled “Drill 4: p99 Latency Spike With No Error Spike”

Commands and sources:

kubectl top pods -n inference
kubectl top nodes
kubectl get hpa -n inference
kubectl get --raw /metrics | grep scheduler_pending

Also inspect:

  • Model server queue time.
  • Batch wait.
  • GPU utilization and memory.
  • Request mix.
  • Recent deploys.
  • Network and downstream latency.

Senior answer:

A latency-only incident often means saturation, queuing, batching, warmup, or dependency latency. I would split total request time into queue, compute, network, and downstream before changing capacity or batch settings.

Commands:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=server
kubectl get pdb -A
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
kubectl describe pod -n <ns> <pod>

Causes:

  • PDB blocks eviction.
  • Pod uses local storage.
  • DaemonSet pod ignored.
  • Finalizer stuck.
  • Workload has no safe replacement capacity.

Senior close:

Drain automation must treat a blocked PDB as a safety signal, not an obstacle to force through during normal operations.

Drill 6: Gateway Route Sends Traffic To Wrong Version

Section titled “Drill 6: Gateway Route Sends Traffic To Wrong Version”

Commands:

kubectl get gateway,httproute,grpcroute -A
kubectl describe httproute -n <ns> <route>
kubectl get svc,endpointslices -n <ns>
kubectl logs -n <gateway-ns> deploy/<gateway-controller>

Check:

  • Route attachment.
  • Hostname and path match.
  • Backend refs and weights.
  • Cross-namespace reference grants.
  • Service selectors.
  • Endpoint readiness.

Commands:

kubectl get networkpolicy -A
kubectl describe networkpolicy -n <ns> <policy>
kubectl exec -n <ns> <pod> -- curl -v http://service:port/health
kubectl exec -n <ns> <pod> -- nslookup service.namespace.svc.cluster.local

If Cilium/Hubble is available:

cilium status
cilium connectivity test
hubble observe --namespace <ns>

Explain:

  • DNS success does not imply TCP success.
  • Service success does not imply pod-to-pod policy success.
  • Policy can block egress, ingress, or DNS.

Commands:

argocd app get <app>
argocd app diff <app>
flux get kustomizations -A
flux get helmreleases -A
kubectl get events -n <app-ns> --sort-by=.lastTimestamp

Strong answer:

I would not fight the reconciler. If the live change is correct, it needs to be committed to source of truth or the controller needs an intentional, audited pause.

Drill 9: Admission Webhook Blocking Deploys

Section titled “Drill 9: Admission Webhook Blocking Deploys”

Commands:

kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get apiservices
kubectl get events -A --sort-by=.lastTimestamp | tail -50

Check:

  • Webhook service endpoints.
  • Failure policy.
  • Timeout.
  • Certificate expiry.
  • Namespace/object selectors.

Senior note:

Admission controls protect production, but a failing webhook can become a cluster-wide deploy outage. Critical webhooks need SLOs, dashboards, timeout discipline, and emergency bypass policy.

Signals:

  • Repeated Xid errors.
  • Uncorrected ECC.
  • Multiple workload failures on same GPU ID.
  • DCGM unhealthy.
  • Thermal/power throttling.
  • Device plugin flapping.

Action plan:

  1. Mark node suspect.
  2. Confirm capacity headroom.
  3. Cordon.
  4. Drain respecting PDBs.
  5. Run diagnostics or reboot.
  6. Validate with GPU test workload.
  7. Return or escalate to hardware repair.

Senior close:

The automation should be conservative until it has enough evidence. Recommendation mode is a good first step for risky hardware actions.

Drill 11: API Server Slow But Inference Still Serves

Section titled “Drill 11: API Server Slow But Inference Still Serves”

Prompt:

Deploys are timing out, kubectl is slow, but existing inference traffic is mostly healthy.

Commands:

kubectl get --raw='/readyz?verbose'
kubectl get --raw='/livez?verbose'
kubectl get --raw='/metrics' | grep apiserver_request
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl get events -A --sort-by=.lastTimestamp | tail -80

What you are checking:

  • Read vs write path.
  • Admission webhook latency/failures.
  • API priority and fairness queueing.
  • Controller/watch lag.
  • Whether user traffic depends on the failing control-plane path.

Senior close:

I would separate serving impact from operations impact. If existing pods and gateways serve, I freeze nonessential deploys and reduce control-plane write load before touching data-plane capacity.

Prompt:

Control-plane alerts show etcd fsync latency and API server write timeouts.

Commands:

kubectl get --raw='/metrics' | grep -E 'etcd|apiserver_storage'
kubectl get leases -A | head
kubectl get events -A --sort-by=.lastTimestamp | tail -80
kubectl get nodes

If you have control-plane host access, inspect etcd member health, leader changes, disk latency, database size, compaction, and defrag history.

Do not:

  • Restart every control-plane component blindly.
  • Run risky fleet-wide deploys during degraded persistence.
  • Assume a successful read means writes are safe.

Senior close:

Etcd incidents are about protecting quorum and write durability. I would lower write pressure, verify data-plane health, inspect disk and member health, and only consider restore from a tested snapshot path.

Drill 13: Admission Webhook Timeout During Rollout

Section titled “Drill 13: Admission Webhook Timeout During Rollout”

Prompt:

Every new Deployment update hangs with admission timeout errors.

Commands:

kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration
kubectl describe validatingwebhookconfiguration <name>
kubectl get svc,endpointslices -n <webhook-ns>
kubectl logs -n <webhook-ns> deploy/<webhook>
kubectl get events -A --sort-by=.lastTimestamp | tail -80

Check:

  • Webhook service endpoints.
  • Timeout seconds.
  • Failure policy.
  • Certificate validity.
  • Namespace/object selectors.
  • Whether the webhook validates critical resources broadly.

Senior close:

A policy system can become a production dependency. For critical admission, I want health checks, SLOs, conservative timeouts, rollout gates, and an audited emergency bypass.

Prompt:

Pods are Ready, but the Gateway or Service still sends traffic to old endpoints or returns intermittent 503s.

Commands:

kubectl get pods -n inference -o wide
kubectl get svc,endpointslices -n inference -o wide
kubectl describe endpointslice -n inference <slice>
kubectl get events -n inference --sort-by=.lastTimestamp
kubectl describe httproute -n inference <route>

Hypotheses:

  • Selector mismatch.
  • Readiness gate not what you think.
  • EndpointSlice controller lag.
  • Route points at the wrong Service.
  • Gateway cached stale backend state.
  • Pods flap readiness during model warmup.

Senior close:

I would avoid saying “the pod is Ready so networking is broken.” Readiness must propagate through EndpointSlice, Service, route, gateway, and client connection behavior.

Drill 15: Terraform State Wants To Destroy Production

Section titled “Drill 15: Terraform State Wants To Destroy Production”

Prompt:

CI shows a plan to destroy and recreate production infrastructure after a module rename.

Commands:

terraform plan -out=tfplan
terraform show -json tfplan
terraform state list
terraform state show <address>
terraform providers

Check:

  • Module path/address changed.
  • count index churn.
  • for_each key changed.
  • Provider alias changed.
  • Backend/workspace mismatch.
  • Missing moved block.
  • Imported resource config mismatch.

Senior close:

I treat unexpected destroy as a change-blocker. The fix is usually address reconciliation through moved blocks or reviewed state movement, not approving a destructive plan because “the module was only refactored.”