Capacity Incident Runbook
Scenario
Section titled “Scenario”Prompt:
Traffic surges, p99 latency climbs, Triton queue time rises, autoscaling starts, but new GPU capacity is not helping fast enough.
Staff-level framing:
Capacity incidents are not solved by “scale up” as a reflex. For GPU inference, the capacity loop includes admission, traffic routing, queueing, batching, node provisioning, GPU stack readiness, model artifact load, warmup, and SLO verification.
Incident Flow
Section titled “Incident Flow”flowchart TD
Page[Capacity page] --> Impact{SLO burn active}
Impact -- no --> Watch[Watch, ticket, or tune alert]
Impact -- yes --> Stabilize[Stabilize users]
Stabilize --> Classify{Bottleneck}
Classify -- Queue --> Admission[Reduce admission or shed low priority]
Classify -- GPU --> Capacity[Use warm capacity or add nodes]
Classify -- Artifact --> Cache[Use cached artifact or freeze scale-out]
Classify -- Network --> Route[Shift route or reduce retries]
Classify -- Bad rollout --> Rollback[Rollback model/runtime/config]
Admission --> Verify[Verify p99 and errors]
Capacity --> Verify
Cache --> Verify
Route --> Verify
Rollback --> Verify
First Five Minutes
Section titled “First Five Minutes”Goal: reduce user impact while preserving enough evidence to avoid a blind capacity scramble.
- Declare incident owner and freeze risky deploys.
- Confirm scope: service/model/version/tenant/region/GPU pool.
- Confirm user impact: p99, errors, timeouts, queue time, token latency, SLO burn.
- Identify bottleneck class: queue, GPU compute, GPU memory, CPU preprocess, artifact load, network, scheduler, autoscaler, downstream.
- Stabilize with the fastest reversible action.
Fast stabilizers:
- Shift traffic to known-good warm capacity.
- Reduce low-priority traffic or batch tenants.
- Tighten admission so stale requests do not consume GPU after client deadlines.
- Roll back a recent batching/model/runtime/config change if correlated.
- Disable or isolate a noisy co-tenant.
- Reduce retry amplification at gateway/client.
Do not:
- Blindly add cold GPU pods if model warmup is the bottleneck.
- Increase batch size during active p99 burn without evidence.
- Drain nodes during capacity shortage unless a node is actively harming traffic.
- Let autoscaler actions hide the original saturation signal.
Capacity Layer Checklist
Section titled “Capacity Layer Checklist”| Layer | Question | Evidence |
|---|---|---|
| Traffic | Did request rate, prompt tokens, payload size, or tenant mix change? | Gateway metrics, route weights, tenant labels, request shape histograms. |
| Queue | Are requests waiting before compute? | Triton queue duration, batch wait, admission backlog. |
| GPU compute | Is execution saturated? | GPU SM utilization, compute duration, model instance metrics. |
| GPU memory | Is memory/KV/cache the limit? | GPU memory, OOM, allocation failures, KV cache utilization. |
| CPU/preprocess | Is GPU low but latency high? | CPU saturation, Python backend spans, tokenizer/preprocess duration. |
| Scheduler | Are pods pending or fragmented? | Unschedulable events, node labels, taints, allocatable GPUs. |
| Autoscaler | Are nodes provisioning but not usable? | Karpenter/CA events, node readiness, GPU Operator validators. |
| Artifact path | Are pods waiting on model/image download? | Image pull time, artifact download time, checksum/init logs. |
| Network/gateway | Are retries or connection limits amplifying load? | Retry rate, timeout rate, connection pools, 503/504 split. |
| Recent change | Did a rollout change capacity per replica? | GitOps diff, deployment markers, model config, batching config. |
Command Drill
Section titled “Command Drill”Kubernetes scope:
kubectl get pods -n inference -o wide
kubectl get events -n inference --sort-by=.lastTimestamp
kubectl get hpa -n inference
kubectl describe pod -n inference <pending-or-slow-pod>
kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/mig.strategy
Rollout and route scope:
kubectl rollout status deploy/<model-server> -n inference
kubectl rollout history deploy/<model-server> -n inference
kubectl get gateway,httproute,grpcroute -A
kubectl get svc,endpointslices -n inference
GPU/operator scope:
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter
kubectl describe node <gpu-node>
The exact commands vary by environment, but the answer structure should not.
Capacity Math
Section titled “Capacity Math”Use rough math under pressure:
required_replicas =
peak_goodput_needed_per_second / safe_goodput_per_replica_per_second
Then adjust for:
- Warmup time.
- Canary/rollout max surge.
- Zone or pool failure reserve.
- Tenant priority.
- GPU SKU differences.
- Batch/online split.
- Model artifact cache hit rate.
- Retry amplification.
Interview phrase:
I distinguish theoretical capacity, ready capacity, warm capacity, and SLO-safe capacity. During an incident only SLO-safe warm capacity helps immediately.
Decision Points
Section titled “Decision Points”| If you see | Prefer |
|---|---|
| Queue high, GPU low | Increase model instances, inspect CPU/preprocess, tune batching carefully. |
| Queue high, GPU high | Shed/admit, shift to warm capacity, scale if warmup can land in time. |
| GPU memory critical | Reduce batch/instance count, isolate model, enforce token limits. |
| Pending GPU pods | Fix scheduling constraints, add correct SKU, reduce fragmentation. |
| Nodes Ready but GPUs not allocatable | GPU Operator/device plugin/driver validation before assuming capacity. |
| New pods Ready slowly | Check artifact/image pull, model load, startup probes, warmup. |
| p99 high with retry storm | Reduce retries/deadlines before adding more backend pressure. |
| Capacity issue follows rollout | Roll back config/model/runtime before expanding bad shape. |
Example Answer
Section titled “Example Answer”Prompt:
“Traffic doubled in five minutes, autoscaler added nodes, but p99 kept rising. What do you do?”
Answer:
- Confirm impact and freeze deploys.
- Check if latency is queue, compute, network, or downstream.
- Verify whether new nodes are actually usable: Ready, GPU allocatable, operator validators healthy, model artifact cached, model warmed.
- Shift to existing warm capacity or shed low-priority traffic.
- Reduce retry amplification and enforce request deadlines.
- If queue and GPU are saturated, add capacity only where warmup can help inside the incident window.
- After recovery, add predictive scaling, warm pools, artifact prefetch, queue-based scaling, and load-test gates.
Prevention Checklist
Section titled “Prevention Checklist”- Warm pools by GPU SKU and model class.
- Model artifact prefetch with checksum validation.
- Queue-based and token-aware scaling, not CPU-only scaling.
- Admission control by tenant/workload class.
- Retry budgets and deadline propagation.
- Capacity dashboards showing theoretical, Ready, warm, and SLO-safe capacity.
- Load tests with realistic request shapes and prompt/token distributions.
- Canary gates that include capacity per replica and warmup time.
- Runbooks for autoscaler, GPU Operator, artifact store, and routing.
Runbook Close
Section titled “Runbook Close”Say this:
I would stabilize with warm capacity, admission control, traffic shift, or rollback before asking cold capacity to save the incident. Then I would close the loop by making capacity visible as a contract: demand, ready supply, warm supply, and SLO-safe goodput.