Capacity Incident Runbook

Scenario

Prompt:

Traffic surges, p99 latency climbs, Triton queue time rises, autoscaling starts, but new GPU capacity is not helping fast enough.

Staff-level framing:

Capacity incidents are not solved by “scale up” as a reflex. For GPU inference, the capacity loop includes admission, traffic routing, queueing, batching, node provisioning, GPU stack readiness, model artifact load, warmup, and SLO verification.

Incident Flow

flowchart TD
  Page[Capacity page] --> Impact{SLO burn active}
  Impact -- no --> Watch[Watch, ticket, or tune alert]
  Impact -- yes --> Stabilize[Stabilize users]
  Stabilize --> Classify{Bottleneck}
  Classify -- Queue --> Admission[Reduce admission or shed low priority]
  Classify -- GPU --> Capacity[Use warm capacity or add nodes]
  Classify -- Artifact --> Cache[Use cached artifact or freeze scale-out]
  Classify -- Network --> Route[Shift route or reduce retries]
  Classify -- Bad rollout --> Rollback[Rollback model/runtime/config]
  Admission --> Verify[Verify p99 and errors]
  Capacity --> Verify
  Cache --> Verify
  Route --> Verify
  Rollback --> Verify

First Five Minutes

Goal: reduce user impact while preserving enough evidence to avoid a blind capacity scramble.

Declare incident owner and freeze risky deploys.
Confirm scope: service/model/version/tenant/region/GPU pool.
Confirm user impact: p99, errors, timeouts, queue time, token latency, SLO burn.
Identify bottleneck class: queue, GPU compute, GPU memory, CPU preprocess, artifact load, network, scheduler, autoscaler, downstream.
Stabilize with the fastest reversible action.

Fast stabilizers:

Shift traffic to known-good warm capacity.
Reduce low-priority traffic or batch tenants.
Tighten admission so stale requests do not consume GPU after client deadlines.
Roll back a recent batching/model/runtime/config change if correlated.
Disable or isolate a noisy co-tenant.
Reduce retry amplification at gateway/client.

Do not:

Blindly add cold GPU pods if model warmup is the bottleneck.
Increase batch size during active p99 burn without evidence.
Drain nodes during capacity shortage unless a node is actively harming traffic.
Let autoscaler actions hide the original saturation signal.

Capacity Layer Checklist

Layer	Question	Evidence
Traffic	Did request rate, prompt tokens, payload size, or tenant mix change?	Gateway metrics, route weights, tenant labels, request shape histograms.
Queue	Are requests waiting before compute?	Triton queue duration, batch wait, admission backlog.
GPU compute	Is execution saturated?	GPU SM utilization, compute duration, model instance metrics.
GPU memory	Is memory/KV/cache the limit?	GPU memory, OOM, allocation failures, KV cache utilization.
CPU/preprocess	Is GPU low but latency high?	CPU saturation, Python backend spans, tokenizer/preprocess duration.
Scheduler	Are pods pending or fragmented?	Unschedulable events, node labels, taints, allocatable GPUs.
Autoscaler	Are nodes provisioning but not usable?	Karpenter/CA events, node readiness, GPU Operator validators.
Artifact path	Are pods waiting on model/image download?	Image pull time, artifact download time, checksum/init logs.
Network/gateway	Are retries or connection limits amplifying load?	Retry rate, timeout rate, connection pools, 503/504 split.
Recent change	Did a rollout change capacity per replica?	GitOps diff, deployment markers, model config, batching config.

Command Drill

Kubernetes scope:

kubectl get pods -n inference -o wide
kubectl get events -n inference --sort-by=.lastTimestamp
kubectl get hpa -n inference
kubectl describe pod -n inference <pending-or-slow-pod>
kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/mig.strategy

Rollout and route scope:

kubectl rollout status deploy/<model-server> -n inference
kubectl rollout history deploy/<model-server> -n inference
kubectl get gateway,httproute,grpcroute -A
kubectl get svc,endpointslices -n inference

GPU/operator scope:

kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
kubectl logs -n gpu-operator -l app=nvidia-dcgm-exporter
kubectl describe node <gpu-node>

The exact commands vary by environment, but the answer structure should not.

Capacity Math

Use rough math under pressure:

required_replicas =
  peak_goodput_needed_per_second / safe_goodput_per_replica_per_second

Then adjust for:

Warmup time.
Canary/rollout max surge.
Zone or pool failure reserve.
Tenant priority.
GPU SKU differences.
Batch/online split.
Model artifact cache hit rate.
Retry amplification.

Interview phrase:

I distinguish theoretical capacity, ready capacity, warm capacity, and SLO-safe capacity. During an incident only SLO-safe warm capacity helps immediately.

Decision Points

If you see	Prefer
Queue high, GPU low	Increase model instances, inspect CPU/preprocess, tune batching carefully.
Queue high, GPU high	Shed/admit, shift to warm capacity, scale if warmup can land in time.
GPU memory critical	Reduce batch/instance count, isolate model, enforce token limits.
Pending GPU pods	Fix scheduling constraints, add correct SKU, reduce fragmentation.
Nodes Ready but GPUs not allocatable	GPU Operator/device plugin/driver validation before assuming capacity.
New pods Ready slowly	Check artifact/image pull, model load, startup probes, warmup.
p99 high with retry storm	Reduce retries/deadlines before adding more backend pressure.
Capacity issue follows rollout	Roll back config/model/runtime before expanding bad shape.

Example Answer

Prompt:

“Traffic doubled in five minutes, autoscaler added nodes, but p99 kept rising. What do you do?”

Answer:

Confirm impact and freeze deploys.
Check if latency is queue, compute, network, or downstream.
Verify whether new nodes are actually usable: Ready, GPU allocatable, operator validators healthy, model artifact cached, model warmed.
Shift to existing warm capacity or shed low-priority traffic.
Reduce retry amplification and enforce request deadlines.
If queue and GPU are saturated, add capacity only where warmup can help inside the incident window.
After recovery, add predictive scaling, warm pools, artifact prefetch, queue-based scaling, and load-test gates.

Prevention Checklist

Warm pools by GPU SKU and model class.
Model artifact prefetch with checksum validation.
Queue-based and token-aware scaling, not CPU-only scaling.
Admission control by tenant/workload class.
Retry budgets and deadline propagation.
Capacity dashboards showing theoretical, Ready, warm, and SLO-safe capacity.
Load tests with realistic request shapes and prompt/token distributions.
Canary gates that include capacity per replica and warmup time.
Runbooks for autoscaler, GPU Operator, artifact store, and routing.

Runbook Close

Say this:

I would stabilize with warm capacity, admission control, traffic shift, or rollback before asking cold capacity to save the incident. Then I would close the loop by making capacity visible as a contract: demand, ready supply, warm supply, and SLO-safe goodput.