Noisy Alert Runbook

Scenario

Prompt:

An inference platform alert fires repeatedly. On-call engineers say it is noisy, but nobody is confident enough to disable it.

The staff-level framing is:

A noisy alert is not harmless. It consumes attention, trains people to ignore pages, and can hide a real incident. The fix is not “delete the alert.” The fix is to prove what user-impacting condition it is supposed to detect, tune it, route it correctly, or replace it.

Triage Flow

flowchart TD
  Page[Alert fires] --> Impact{User impact or imminent risk}
  Impact -- yes --> Incident[Run incident response]
  Impact -- no --> Evidence[Collect alert evidence]
  Evidence --> Owner{Clear owner and action}
  Owner -- no --> Route[Fix ownership or route to ticket]
  Owner -- yes --> Actionable{Action known and useful}
  Actionable -- no --> Redesign[Redesign alert or convert to signal]
  Actionable -- yes --> Precision{High precision}
  Precision -- no --> Tune[Tune threshold, window, labels, inhibition]
  Precision -- yes --> Keep[Keep page and improve runbook]

First Five Minutes

Goal: decide whether this is an incident, a bad alert, or an informational signal routed incorrectly.

Confirm current user impact: SLO burn, error rate, p99, admission failures, model unavailable, tenant impact.
Identify exact alert labels: service, model, namespace, cluster, GPU pool, route, tenant, severity.
Check last deploy/change markers: model, runtime image, GPU driver, node image, route weights, HPA/KEDA config, Terraform/GitOps change.
Compare alert signal with user-facing symptoms.
Assign a temporary disposition: incident, ticket, silence with owner, or alert-bug.

Do not:

Silence globally without owner and expiry.
Rename the alert to make dashboards quieter.
Page humans for a condition with no action.
Convert a real SLO alert into a low-priority ticket because it is politically annoying.

Evidence Checklist

Question	Evidence
What condition fired?	Alert expression, threshold, evaluation window, labels, annotations.
What user impact happened?	SLO burn, synthetic probes, external checks, customer-facing latency/error dashboards.
How often does it fire?	Alert history, repeats per week, duration distribution, time-of-day correlation.
Is there a known action?	Runbook step, rollback, capacity add, traffic shift, policy fix, dependency escalation.
Is the scope right?	Model/version/tenant/pool labels, cluster labels, route labels, ownership labels.
Is it duplicated?	Related alerts, inhibition rules, parent dependency alerts, multi-window alerts.
Is the threshold wrong?	Baseline distribution, seasonality, burst behavior, canary vs prod comparison.
Is the alert late or early?	Detection time vs actual incident timeline and mitigation time.

Alert Decision Matrix

Alert type	Page?	Example	Fix
User-impacting SLO burn	Yes	p99/error budget burn for production inference route.	Keep page, improve context and runbook.
Imminent capacity risk	Yes	GPU pool has no warm headroom while queue time is rising.	Page if action exists: shift traffic, shed load, add warm pool.
Single transient blip	No	One scrape miss, one pod restart, short-lived queue bump.	Increase window, use burn-rate logic, route to dashboard.
No owner/no action	No	”GPU utilization high” with no SLO impact or threshold rationale.	Convert to ticket or metric until action and owner are defined.
Dependency-wide failure	Maybe	Object store degraded causes many model-load alerts.	Inhibit child alerts behind dependency alert.
Chronic known issue	Maybe	Repeated low-priority tenant burst exceeding quota.	Ticket product/platform owner, page only if SLO or quota risk.
Missing context	Maybe	Alert says “high latency” without model/version/pool labels.	Fix labels/annotations before trusting severity.

Query Patterns

Use the platform’s actual metric names, but speak in patterns:

# Multi-window burn rate shape
(
  rate(errors_total[5m]) / rate(requests_total[5m]) > short_window_threshold
)
and
(
  rate(errors_total[1h]) / rate(requests_total[1h]) > long_window_threshold
)

# Queue time plus saturation, not queue time alone
histogram_quantile(0.99, rate(inference_queue_seconds_bucket[5m]))
  > slo_queue_budget
and
gpu_memory_utilization_ratio
  > 0.90

# Missing metrics should be scoped
absent(up{job="triton", namespace="inference"} == 1)

Interview caveat:

I avoid alerting on raw utilization unless the utilization means user impact, capacity exhaustion, or a known failure mode. High GPU utilization can be good. High queue time with SLO burn is a problem.

Tuning Levers

Threshold: Move from arbitrary threshold to SLO, capacity budget, or baseline percentile.
Window: Use longer windows for noisy metrics, shorter windows for fast-burn user impact.
Conjunction: Page when multiple signals agree, such as queue time plus error burn.
Inhibition: Suppress child alerts when a parent dependency alert explains them.
Grouping: Group by model/service/pool, not by pod if pod-level noise creates page storms.
Routing: Page for immediate action; ticket for cleanup; dashboard for context.
Annotations: Include owner, dashboard, runbook, recent deploy link, and first command/check.
Severity: Separate “wake someone up” from “needs investigation this week.”

Example: Noisy Triton Queue Alert

Symptoms:

Alert fires during normal daily traffic peaks.
p99 user latency remains inside SLO.
GPU utilization is high but stable.
On-call silences it repeatedly.

Better investigation:

Compare queue duration to end-to-end p99 and error budget burn.
Break down by model/version/tenant, not only pod.
Inspect batch size distribution and request mix.
Check if traffic class is online, batch, or low-priority.
Decide if the alert should page only when queue time consumes the SLO budget.

Better alert:

Page when p99 queue duration exceeds its budget and SLO burn is active for the same model/version/workload class. Ticket if queue is elevated without SLO impact.

Example: Noisy GPU Memory Alert

Symptoms:

GPU memory is above 90 percent for hours.
Inference is healthy.
Model is expected to occupy most memory.

Senior answer:

GPU memory high is not automatically bad. I care about headroom relative to batch growth, KV cache, model load, instance count, and OOM risk. If the workload is designed to reserve memory, alert on failed allocations, OOM, Xid, model load failure, or headroom below a tested envelope.

Runbook Close

Say this in the interview:

I would not let noisy alerts rot. I would classify each one as incident, ticket, dashboard, or delete. For pages, I require user-impact rationale, owner, action, severity, labels, and a runbook. For non-pages, I still preserve the signal if it helps debugging.