GPU Profiling, OOM, and Utilization Runbook

Interview Frame

The senior answer is:

I treat GPU incidents as capacity, correctness, and hardware-signal problems at the same time. I first stabilize traffic, preserve evidence, identify whether the failure is application memory, GPU memory, kernel reset, or node health, then tune for SLO-safe goodput rather than raw utilization.

Do not say only “check nvidia-smi.” That is a starting point, not an operating model.

flowchart TD
  Alert[GPU OOM or crash alert] --> Stabilize[Stabilize traffic]
  Stabilize --> Evidence[Capture evidence before restart loops erase it]
  Evidence --> Classify{Failure class}
  Classify -- CUDA OOM --> Envelope[Reduce request, batch, instance, or KV cache envelope]
  Classify -- Container OOM --> Cgroup[Inspect CPU RAM, tmpfs, dataloader, tokenizer, logs]
  Classify -- Xid or reset --> Node[Quarantine node or GPU and inspect hardware signals]
  Classify -- Model load OOM --> Artifact[Validate model, engine, precision, and repository change]
  Envelope --> Validate[Replay representative traffic]
  Cgroup --> Validate
  Node --> Validate
  Artifact --> Validate
  Validate --> Prevent[Add guardrails, dashboards, and canary gates]

What GPU Usage Means

“GPU usage” is not one metric. Track enough signals to know whether the accelerator is doing useful work, waiting on the host, starved by IO, memory-bound, throttled, or failing.

Signal	Why it matters	Common source
SM utilization	Whether CUDA cores are active.	DCGM, `nvidia-smi`, Nsight.
Tensor core utilization	Whether matrix units are used efficiently for FP16/BF16/INT8 workloads.	Nsight, framework profiler, hardware counters.
GPU memory used/free	Risk for model load, batching, KV cache, fragmentation, and OOM.	DCGM, `nvidia-smi`, Triton Model Analyzer.
HBM bandwidth	Whether workload is memory-bandwidth-bound.	DCGM fields, Nsight Compute.
PCIe/NVLink throughput	Whether host-device or GPU-GPU transfer is the bottleneck.	DCGM, Nsight Systems.
Power, clocks, thermals	Detect throttling, power caps, cooling, and boost behavior.	DCGM, `nvidia-smi dmon`.
ECC and Xid errors	Separate software overload from hardware or driver instability.	DCGM, kernel logs, GPU Operator health.
Per-process memory	Find the model, backend, or tenant consuming memory.	`nvidia-smi pmon`, Kubernetes pod mapping.
Queue time	Distinguish server saturation from batching or admission problems.	Triton metrics, gateway metrics.
Request duration and goodput	Tie GPU work to customer-facing SLO.	Triton, service metrics, traces.

Senior framing:

High GPU utilization with rising queue time and p99 can mean useful saturation. High utilization with errors means overload or bad admission. Low utilization with high latency means the GPU may be waiting on CPU, network, serialization, tokenizer, storage, or batch formation.

Tool Stack

Use layered tools. Each answers a different question.

Layer	Tools	Questions answered
Fleet health	DCGM exporter, Prometheus, Grafana, GPU Operator validators.	Are GPUs healthy, visible, allocated, and free of Xid/ECC/throttle symptoms?
Node triage	`nvidia-smi`, `nvidia-smi dmon`, `nvidia-smi pmon`, `dmesg`, kubelet logs.	Which GPU/process failed, and is this software, runtime, or hardware?
Inference server	Triton `/metrics`, logs, traces, perf_analyzer, Model Analyzer.	Is time spent in queue, input, compute, output, load/unload, or memory pressure?
Kernel timeline	Nsight Systems.	Are there host-device gaps, CPU stalls, copies, synchronization points, or idle GPU windows?
Kernel detail	Nsight Compute.	Is a kernel occupancy-bound, memory-bound, divergent, uncoalesced, or missing tensor cores?
Framework	PyTorch profiler, CUDA memory summary, allocator stats, TensorRT profiler.	Is the model leaking, fragmenting, using unexpected precision, or issuing inefficient kernels?

OOM Failure Classes

Classify the OOM before fixing it.

Failure	Symptoms	First checks	Fast mitigations
CUDA allocation OOM	App logs show CUDA OOM; pod may stay alive or restart.	Request shape, batch size, instance count, KV cache, allocator stats.	Lower batch, max tokens, concurrency, model instances, or route large requests separately.
Model load OOM	Triton/model server fails during load or reload.	Artifact diff, precision, engine plan, config, GPU SKU/MIG size.	Roll back artifact/config, use smaller precision/engine, deploy to larger slice.
KV cache OOM	LLM works at short prompts but fails at long context or many sequences.	Prompt length, generated tokens, inflight batch, cache sizing.	Cap context/tokens, reduce concurrent sequences, separate long-context pool.
Fragmentation	Free memory appears available but allocation fails after churn.	Allocator behavior, model reloads, mixed shapes, long uptime.	Restart after evidence, reduce churn, tune allocator, isolate shape classes.
Container memory OOM	Kubernetes reports `OOMKilled`; GPU may be fine.	`kubectl describe pod`, previous logs, CPU RAM, tokenizer, pinned memory.	Increase memory limit, fix host memory growth, separate preprocessing.
Xid/reset	GPU disappears, driver reset, pods fail together on one GPU/node.	DCGM health, `dmesg`, Xid code, ECC, thermals, driver version.	Cordon/drain/quarantine node, shift traffic, escalate hardware/driver path.
MIG slice too small	Workload fails only on smaller MIG profile.	Node labels, allocated resource name, model memory envelope.	Schedule to correct profile or split workload class.

flowchart LR
  OOM[OOM observed] --> K8s{Kubernetes OOMKilled?}
  K8s -- yes --> HostMem[Host memory path: tokenizer, dataloader, logs, pinned memory]
  K8s -- no --> Xid{Xid/reset/ECC?}
  Xid -- yes --> Hardware[Node/GPU health path]
  Xid -- no --> Load{Fails during model load?}
  Load -- yes --> ModelLoad[Artifact/config/SKU path]
  Load -- no --> Runtime{Fails under traffic?}
  Runtime -- yes --> Envelope[Batch, shape, concurrency, KV cache path]
  Runtime -- no --> Leak[Leak, fragmentation, reload churn path]

Immediate OOM Runbook

Stop the blast radius: pause rollout, disable autoscaler expansion of the bad version, and route traffic to a known-good pool if available.
Preserve evidence: collect previous container logs, Triton logs, kube events, DCGM metrics, Xid/ECC signals, request shape histograms, config diffs, and model artifact version.
Decide whether the node is trustworthy: Xid, ECC, reset, or thermal anomalies push the response toward quarantine instead of pure application tuning.
Reduce the memory envelope: lower dynamic batch size, preferred batch sizes, model instance count, max sequence length, max output tokens, inflight sequences, or admission limits.
Separate traffic classes: route long-context, large-image, or high-resolution requests to a pool sized for them.
Validate with replay or synthetic load that matches the failing request shape.
Add a prevention gate so the same model/config cannot promote without memory headroom and worst-case shape coverage.

What to avoid:

Blindly restarting until the pod looks green.
Scaling replicas when every replica loads the same oversized model and fails.
Increasing CPU memory limits for a CUDA OOM.
Chasing utilization before confirming errors, retries, and p99.
Treating MIG, time-slicing, and full-GPU allocation as equivalent capacity.

Keeping The GPU Busy

The goal is not “100% GPU.” The goal is the highest SLO-safe goodput per dollar.

flowchart TD
  Workload[Workload and SLO] --> Baseline[Baseline queue, latency, GPU, memory]
  Baseline --> Bottleneck{Primary bottleneck}
  Bottleneck -- GPU idle, queue low --> Host[Fix client, CPU, tokenizer, IO, serialization]
  Bottleneck -- GPU idle, queue high --> Batch[Improve batching, queue delay, concurrency]
  Bottleneck -- GPU busy, p99 ok --> ScaleGood[Track headroom and cost efficiency]
  Bottleneck -- GPU busy, p99 bad --> Split[Admission, scale out, isolate workload classes]
  Bottleneck -- memory high --> Memory[Reduce envelope or choose different GPU/MIG profile]
  Host --> Measure[Measure one change]
  Batch --> Measure
  ScaleGood --> Measure
  Split --> Measure
  Memory --> Measure
  Measure --> Gate{Goodput improved?}
  Gate -- yes --> Keep[Keep and document]
  Gate -- no --> Revert[Revert and try next hypothesis]

Levers:

Dynamic batching: improves throughput when requests can wait briefly. Watch p99 and fairness.
Preferred batch sizes: align with TensorRT engine profiles or known efficient kernel sizes.
max_queue_delay_microseconds: gives the scheduler time to form batches; too high burns tail latency.
Model instance count: improves concurrency until memory, context switching, or contention dominates.
Inflight batching for LLMs: keeps decode steps packed; control KV cache and fairness.
TensorRT engines: improve kernel efficiency when shapes and precision are understood.
FP16/BF16/INT8: reduce memory and improve throughput only after accuracy and compatibility validation.
gRPC and shared memory: reduce transport and copy overhead for large tensors.
CPU preprocessing scale-out: prevents tokenizer/image decode/serialization from starving the GPU.
Warmup and model residency: avoid cold-load stalls and first-request compilation paths.
Separate deployments: isolate latency-sensitive, batch, long-context, and large-shape workloads.

Senior caveat:

I would rather run at 65% GPU with predictable p99 and room for retries than at 95% with queue collapse. High utilization is healthy only when queue time, errors, and latency remain controlled.

Profiling Workflow

Use this sequence in an interview or real incident review:

Define the user-facing objective: p99, throughput, token rate, error budget, cost per successful request.
Capture a baseline: request rate, p50/p95/p99, queue time, compute time, GPU SM, GPU memory, HBM, PCIe/NVLink, power, clocks, CPU, and network.
Reproduce with realistic shapes: prompt length, image size, batch distribution, concurrency, deadlines, and tenant mix.
Check server decomposition: queue, compute input, inference compute, compute output, load/unload.
If GPU idle windows exist, use Nsight Systems to find CPU, copy, synchronization, or network stalls.
If kernels dominate and the GPU is busy, use Nsight Compute or backend profiler to inspect occupancy, memory bandwidth, tensor core use, and kernel fusion.
Change one lever: batch, queue delay, instance count, precision, engine profile, concurrency, CPU workers, transport, or admission.
Gate the change with SLO, goodput, memory headroom, error rate, and hardware health.

sequenceDiagram
  participant Client
  participant Gateway
  participant Triton
  participant CPU as CPU preprocess
  participant GPU
  participant Metrics

  Client->>Gateway: Realistic request shape
  Gateway->>Triton: Deadline and tenant context
  Triton->>Metrics: Queue duration
  Triton->>CPU: Decode/tokenize/prepare tensor
  CPU->>Metrics: Host time and saturation
  CPU->>GPU: H2D copy and launch
  GPU->>Metrics: SM, memory, power, bandwidth
  GPU-->>Triton: Result
  Triton->>Metrics: Compute and output duration
  Triton-->>Client: Response within or outside SLO

Dashboards To Build

Have separate dashboards for service, server, GPU, and node health.

Dashboard	Must show
Service SLO	RPS, success rate, p50/p95/p99, retries, deadline exceeded, tenant/workload class.
Triton/model	Queue duration, input/compute/output duration, model/version, load status, batch distribution.
GPU utilization	SM utilization, memory used, HBM bandwidth, PCIe/NVLink, power, clocks, thermals.
GPU health	Xid, ECC corrected/uncorrected, resets, DCGM health, node/GPU ID, driver/runtime version.
Capacity	Allocated GPUs, MIG profiles, pod placement, pending pods, headroom, cost per good request.

Alert on symptoms, not vanity metrics:

GPU memory above threshold plus rising allocation failures.
Xid or uncorrected ECC on a production node.
Queue time growth with stable input traffic.
GPU idle with rising p99.
Power/thermal throttling correlated with latency.
Triton model reload failures.
Goodput drop even if raw throughput is high.

Kubernetes Triage Commands

Use commands that connect the failed request to a pod, GPU, node, and driver signal.

kubectl -n inference get pod -o wide
kubectl -n inference describe pod <pod>
kubectl -n inference logs <pod> --previous
kubectl describe node <gpu-node>
kubectl get events --all-namespaces --sort-by=.lastTimestamp

On the node or in an approved debug workflow:

nvidia-smi
nvidia-smi dmon -s pucvmet
nvidia-smi pmon
dmesg | grep -i xid

In production, prefer controlled debug access and auditability. The important skill is mapping:

request -> model/version -> pod -> container -> GPU UUID/MIG device -> node -> driver/runtime

Interview Answer: GPU OOM

Question: “A GPU inference service starts crashing with out-of-memory errors. What do you do?”

Answer:

I would first stop the rollout or shift traffic so the incident does not expand. Then I would preserve evidence: previous logs, Kubernetes events, Triton metrics, DCGM, Xid/ECC, recent model/config changes, and request shapes. I would classify the failure: Kubernetes OOMKilled is host memory, CUDA OOM is GPU memory, model load OOM is artifact/config/SKU, and Xid/reset is node health. For an immediate mitigation I would reduce the memory envelope: batch size, instance count, max tokens or context, inflight sequences, or route large requests to a separate pool. I would validate with the failing shapes and add canary gates for GPU memory headroom, worst-case request shape, and no Xid/ECC anomalies before promoting again.

Interview Answer: Keeping GPUs Busy

Question: “How do you keep GPUs as busy as possible?”

Answer:

I optimize for goodput within SLO, not utilization alone. I would measure queue time, compute time, SM utilization, memory bandwidth, GPU memory, H2D/D2H copies, CPU preprocessing, and p99. If the GPU is idle while latency is high, I look for CPU, tokenizer, serialization, network, or batching gaps. If the GPU is busy and p99 is healthy, I track headroom and cost. If the GPU is busy and p99 is bad, I scale, isolate workload classes, or tighten admission. The levers are dynamic batching, queue delay, preferred batch sizes, model instance count, TensorRT profiles, precision, inflight batching for LLMs, shared memory/gRPC, and separate pools for different shapes.

Staff-Level Close

The differentiator is tying accelerator metrics to customer outcomes:

I do not trust a GPU dashboard unless I can connect it to model version, pod, tenant class, request shape, p99, error rate, and cost per successful inference. Otherwise I only know the GPU is hot, not whether the platform is healthy.