Observability and Incidents
Observability Contract
Section titled “Observability Contract”Every production workflow should answer:
- Is the user impacted?
- Which service/model/version/pool is affected?
- What changed?
- Which layer is failing?
- What is the safest mitigation?
- How do we know it is fixed?
flowchart LR Signal[Alert signal] --> Context[Dashboards and traces] Context --> Change[Recent change timeline] Change --> Hypothesis[Layered hypothesis] Hypothesis --> Mitigation[Mitigation] Mitigation --> Verification[Verify SLO recovery] Verification --> Postmortem[Postmortem and prevention]
Golden Signals For Inference
Section titled “Golden Signals For Inference”- Availability: success rate, 5xx, model errors, admission errors.
- Latency: p50/p95/p99, queue time, batch wait, model compute time.
- Traffic: request rate, token rate, request shape, tenant mix.
- Saturation: GPU utilization, GPU memory, CPU, network, disk, scheduler pending.
- Correctness: bad responses, validation errors, model version mismatch.
Alert Quality
Section titled “Alert Quality”Good alerts:
- Page on user impact or imminent impact.
- Include scope and likely owner.
- Link to a runbook and dashboard.
- Avoid paging on noisy symptoms without action.
- Have inhibition rules for dependency-wide failures.
Bad alerts:
- Page on CPU high without customer impact.
- Duplicate many pages for one root cause.
- Lack service/model/pool labels.
- Fire after the incident is already obvious from user reports.
Incident Command Script
Section titled “Incident Command Script”Use this in behavioral and technical loops:
- Declare severity and owner. “I will take incident command unless someone else is already assigned.”
- Stabilize. Freeze risky deploys, protect evidence, reduce blast radius.
- Assess impact. Affected tenants, models, regions, SLO, error budget burn.
- Form hypotheses by layer. Traffic, app, model server, scheduler, node, GPU, network, storage, dependency.
- Mitigate before perfect root cause. Rollback, fail over, disable bad pool, scale known-good capacity.
- Communicate. Timestamped updates, next action, ETA for next update.
- Verify recovery. User-facing metrics first, then internal health.
- Follow through. Postmortem, action items with owners, prevention automation.
sequenceDiagram participant IC as Incident commander participant Ops as Operator participant SRE as Platform engineer participant Comms as Stakeholders IC->>Ops: stabilize customer impact IC->>SRE: isolate failing layer IC->>Comms: impact and next update time SRE-->>IC: hypothesis and evidence Ops-->>IC: mitigation result IC->>Comms: recovery status
Postmortem Quality
Section titled “Postmortem Quality”Strong postmortems include:
- Timeline with detection, mitigation, and recovery.
- Customer impact.
- What changed.
- What signals were missing or misleading.
- Why safeguards did not catch it.
- Action items that remove a class of failure, not just one symptom.
Example Incident Answer
Section titled “Example Incident Answer”Prompt: “A GPU pool is failing health checks after a node image update.”
Answer:
I would halt rollout, identify the affected pool and image version, and compare canary nodes with healthy nodes. I would check driver, kernel, container toolkit, device plugin, kubelet events, and DCGM exporter. If production is impacted, I would cordon affected nodes and shift traffic or roll back the image. After recovery, I would add a pre-promotion validator that starts a representative GPU workload, verifies telemetry, and blocks promotion on driver/runtime mismatch.
Incident Drill Matrix
Section titled “Incident Drill Matrix”| Incident | First stabilizer | Evidence to split layers | Prevention |
|---|---|---|---|
| API server slow, traffic healthy | Freeze nonessential deploys and reduce write load. | API server latency, admission errors, APF queues, controller lag, data-plane SLO. | Control-plane SLOs, admission dashboards, write-load budgets. |
| Etcd write timeouts | Protect quorum and stop risky automation. | Etcd member health, fsync latency, leader changes, DB size, API storage latency. | Tested snapshot restore, compaction/defrag hygiene, disk latency alerts. |
| Admission webhook outage | Pause rollouts or use approved bypass. | Webhook endpoints, failure policy, timeout, certs, selector scope, admission audit logs. | Webhook SLO, fail-open/closed policy by risk, staged webhook rollout. |
| Terraform state lock stuck | Confirm no active apply before force-unlock. | CI runs, backend lock owner, timestamps, plan/apply logs, state serial. | Run ownership metadata, timeout cleanup, backend-specific lock runbook. |
| Terraform unexpected destroy | Block apply. | Plan JSON, state addresses, module path changes, provider aliases, count/for_each key changes. | Plan policy for destroy/replace, moved blocks, refactor review checklist. |
| EndpointSlice stale during rollout | Shift traffic to known-good route or pause rollout. | Pod readiness transitions, EndpointSlice contents, Service selectors, Gateway route backend refs. | Synthetic endpoint propagation checks and rollout gates. |
| Model artifact store slows scale-out | Shift to warm capacity or shed lower-priority load. | Image pull time, artifact download time, object-store errors, node disk/network saturation, model load. | Regional cache, prefetch, checksum validation, warm pools. |
| Retry storm after p99 spike | Cut retry budget or shed load at edge. | Gateway retries, client deadlines, request concurrency, Triton queue time, downstream saturation. | Retry budgets, deadline propagation, overload tests. |
| GPU device plugin flapping | Cordon suspect nodes or pool. | Device plugin logs, kubelet allocatable changes, DCGM, Xid/ECC, driver/runtime events. | Operator canary, node quarantine automation, driver compatibility gates. |
| GitOps reverts emergency hotfix | Pause sync intentionally or commit fix. | Controller events, drift diff, incident timeline, live vs desired manifests. | Break-glass process and time-bounded sync pause policy. |
flowchart TD
Page[Page fires] --> Impact{User impact active}
Impact -- yes --> Stabilize[Mitigate first]
Impact -- no --> Preserve[Preserve evidence and scope]
Stabilize --> Scope[Scope by layer]
Preserve --> Scope
Scope --> Change{Recent change}
Change -- yes --> Rollback[Rollback or pause change]
Change -- no --> Saturation{Saturation or dependency}
Saturation -- yes --> Capacity[Reduce load, add warm capacity, or shed]
Saturation -- no --> DeepDive[Debug internals and missing signal]
Rollback --> Verify[Verify SLO recovery]
Capacity --> Verify
DeepDive --> Verify
Metrics To Mention
Section titled “Metrics To Mention”- MTTD, MTTA, MTTR.
- Alert precision and recall.
- Error budget burn.
- Toil hours per week.
- Change failure rate.
- Rollback frequency.
- Automation success and manual override rate.