Inference Systems Q&A

51. Why can average latency improve while p99 gets worse?

Batching or caching may help common requests while rare large requests, cold paths, queueing, or noisy neighbors worsen tail latency.

52. What is the main operational risk of dynamic batching?

It improves throughput by waiting for compatible requests, but can add queue delay and worsen p99 if max wait and batch size are mis-tuned.

53. How do you debug p99 regression after changing batch size?

Compare queue wait, batch wait, inference compute time, request shapes, GPU utilization, memory, and error rate between old and new configs under same traffic mix.

54. Why can GPU utilization be high but throughput low?

Inefficient kernels, memory bandwidth bottlenecks, serialization, small batches, CPU preprocessing, data transfer overhead, or model server queueing can waste compute.

55. Why can GPU utilization be low but latency high?

Bottleneck may be CPU preprocessing, network, storage, locks, model load, queue admission, downstream calls, or request shape mismatch preventing batching.

56. What is a model resource envelope?

A declared contract for GPU memory, CPU, host memory, batch settings, max sequence/input size, expected latency/throughput, runtime image, and hardware compatibility.

57. Why is a resource envelope useful at admission time?

It prevents models from reaching production without enough information to schedule, validate, autoscale, alert, and roll back safely.

58. What is a dangerous assumption about model artifacts?

That artifact download is cheap and reliable. Large artifacts can dominate scale-out time and overload storage during bursts.

59. How do you make model artifacts production-safe?

Immutable versions, checksums, provenance, regional caching, access control, prefetch, validation, and explicit rollback to known-good versions.

60. Why can a model rollout pass staging but fail production?

Different traffic mix, request sizes, tenant concurrency, GPU SKU, model cache state, artifact path, data distribution, or production-only dependencies.

61. What should a canary for inference measure?

Error rate, p95/p99 latency, queue time, batch wait, GPU memory, GPU utilization, correctness checks, timeout rate, and model-specific business signals.

62. Why is "model loaded" not enough for readiness?

The model can load but fail inference due to GPU runtime errors, bad tokenizer/config, incompatible request shapes, or broken preprocessing/postprocessing.

63. What is shadow traffic useful for?

Testing new versions against production-like inputs without serving their responses, useful for correctness and latency comparison if privacy and cost allow.

64. What is the trap with retries in inference services?

Retries can amplify load, worsen queueing, duplicate expensive GPU work, and turn a partial slowdown into a wider outage.

65. How do you make retries safer?

Use deadlines, retry budgets, idempotency where relevant, jitter, limited retry classes, and avoid retrying after work has clearly saturated the backend.

66. What is load shedding?

Rejecting or deprioritizing lower-value requests under overload to preserve SLO for critical traffic and prevent total collapse.

67. How do you decide what traffic to shed?

By tenant priority, request class, deadline, quota, model criticality, and business impact, with explicit policy agreed before incidents.

68. Why is autoscaling by RPS insufficient?

Requests have different shapes and costs. One large prompt/image/batch can consume far more compute than many small requests.

69. What is a better scaling signal for LLM-style serving?

Queue time, in-flight tokens, prefill/decode utilization, KV cache pressure, GPU memory, and p99 latency by request class.

70. Why does KV cache matter operationally?

It can dominate GPU memory during generation, limiting concurrency and causing latency spikes or OOM under long-context workloads.

71. How can admission control protect GPU memory?

Reject or queue requests that exceed token/input limits, tenant quota, concurrency limits, or estimated memory envelope before they overload serving.

72. Why separate prefill and decode capacity?

They have different compute/memory characteristics. Separating or scheduling them carefully can improve throughput and tail latency for LLM serving.

73. What is a model server cold start?

Time to start process, load artifact, initialize runtime, allocate GPU memory, compile/optimize kernels, warm caches, and pass readiness.

74. How do you reduce cold start impact?

Warm pools, artifact prefetch, image pre-pull, model warmup jobs, startup probes, predictive scaling, and traffic only after representative readiness.

75. What is a rollback trap for model deployments?

Rolling back manifests without rolling back model artifact/config/runtime compatibility can leave mixed or invalid state.

76. What should be versioned together?

Model artifact, tokenizer/config, runtime image, serving config, resource envelope, and validation results.

77. What should not always be changed together?

Model version, serving runtime, GPU driver, Kubernetes version, and network policy. Coupling them obscures root cause and increases rollback risk.

78. Why is correctness observability harder than availability?

A service can return HTTP 200 with wrong or degraded outputs. You need validation samples, semantic checks, business metrics, or downstream quality signals.

79. How do you detect silent model degradation?

Golden test sets, canary comparisons, shadow evaluation, output distribution monitoring, user/business signals, and version-specific anomaly alerts.

80. Why can batching hurt fairness?

Large or high-volume tenants can dominate batch formation and queue slots, increasing latency for smaller tenants unless queues and quotas are tenant-aware.

81. What is the difference between throughput and goodput?

Throughput is total processed work. Goodput is useful successful work within SLO and correctness constraints.

82. Why track goodput for inference?

High raw throughput is meaningless if requests time out, violate p99, or produce invalid results.

83. What is the senior answer to "maximize GPU utilization"?

Maximize useful utilization subject to SLO, isolation, correctness, and failure-domain constraints. 100% utilization can mean no headroom and bad p99.

84. How do you capacity plan for inference?

Use traffic shape, model cost, SLO, concurrency, warmup time, hardware SKU, failure headroom, tenant growth, rollout overhead, and burst patterns.

85. What is N+1 capacity for GPU pools?

Enough spare capacity to survive losing a node, rack, zone, or pool segment without violating critical SLOs, depending on the failure domain chosen.

86. Why can model artifact caching be a consistency risk?

Bad cache invalidation can serve stale or mixed versions. Use immutable artifact IDs and checksums rather than mutable names.

87. What is the problem with mutable `latest` model tags?

They destroy reproducibility, rollback clarity, auditability, and canary comparison.

88. Why does inference need request deadlines?

Without deadlines, stale work consumes GPU capacity even after clients no longer need the result.

89. What is a timeout hierarchy?

Client, gateway, service, model server, and downstream timeouts aligned so work stops before callers give up and retries do not amplify overload.

90. What is an inference-specific health dashboard?

A dashboard layered by request SLOs, model/version, queue/batch time, GPU health, node/pool, rollout events, and artifact/runtime status.

91. How do you debug "only one tenant is slow"?

Check tenant routing, quotas, request shape, concurrency, queue partition, model version, cache hits, regional placement, and noisy-neighbor effects.

92. How do you debug "one model version is slow"?

Compare artifact/config/runtime diffs, batch settings, model size, precision, tokenizer, request mix, GPU memory, and kernel/runtime metrics.

93. What is the trap in using only synthetic probes?

They may not match real request shape, tenant mix, input sizes, caches, or downstream behavior.

94. How do you design better synthetic probes?

Use representative input classes, version-specific expected outputs, GPU execution, latency thresholds, and run them during rollout and continuously.

95. Why can CPU be the inference bottleneck?

Tokenization, image preprocessing, compression, JSON serialization, TLS, logging, or Python overhead can saturate CPU before GPU.

96. How do you reduce CPU bottlenecks?

Profile, optimize preprocessing, use native libraries, batch CPU work, adjust thread pools, avoid excessive logging, and size CPU with GPU envelope.

97. What is tail amplification?

Small delays in multiple stages combine into much worse end-to-end p99, especially when queues and retries interact.

98. How do you control tail amplification?

Deadlines, queue limits, admission control, circuit breakers, bounded retries, isolation by tenant/model, and measuring per-stage latency.

99. What is the best rollback signal?

User-facing SLO regression tied to the canary/version, confirmed by error/latency/correctness metrics and not explained by unrelated traffic or dependency changes.

100. What is the senior/staff close for inference operations?

Inference reliability is a contract: artifact, runtime, resource envelope, health semantics, rollout policy, telemetry, and ownership must be explicit before production traffic.