Skip to content

Observability and Incidents

Every production workflow should answer:

  • Is the user impacted?
  • Which service/model/version/pool is affected?
  • What changed?
  • Which layer is failing?
  • What is the safest mitigation?
  • How do we know it is fixed?
flowchart LR
  Signal[Alert signal] --> Context[Dashboards and traces]
  Context --> Change[Recent change timeline]
  Change --> Hypothesis[Layered hypothesis]
  Hypothesis --> Mitigation[Mitigation]
  Mitigation --> Verification[Verify SLO recovery]
  Verification --> Postmortem[Postmortem and prevention]
  • Availability: success rate, 5xx, model errors, admission errors.
  • Latency: p50/p95/p99, queue time, batch wait, model compute time.
  • Traffic: request rate, token rate, request shape, tenant mix.
  • Saturation: GPU utilization, GPU memory, CPU, network, disk, scheduler pending.
  • Correctness: bad responses, validation errors, model version mismatch.

Good alerts:

  • Page on user impact or imminent impact.
  • Include scope and likely owner.
  • Link to a runbook and dashboard.
  • Avoid paging on noisy symptoms without action.
  • Have inhibition rules for dependency-wide failures.

Bad alerts:

  • Page on CPU high without customer impact.
  • Duplicate many pages for one root cause.
  • Lack service/model/pool labels.
  • Fire after the incident is already obvious from user reports.

Use this in behavioral and technical loops:

  1. Declare severity and owner. “I will take incident command unless someone else is already assigned.”
  2. Stabilize. Freeze risky deploys, protect evidence, reduce blast radius.
  3. Assess impact. Affected tenants, models, regions, SLO, error budget burn.
  4. Form hypotheses by layer. Traffic, app, model server, scheduler, node, GPU, network, storage, dependency.
  5. Mitigate before perfect root cause. Rollback, fail over, disable bad pool, scale known-good capacity.
  6. Communicate. Timestamped updates, next action, ETA for next update.
  7. Verify recovery. User-facing metrics first, then internal health.
  8. Follow through. Postmortem, action items with owners, prevention automation.
sequenceDiagram
  participant IC as Incident commander
  participant Ops as Operator
  participant SRE as Platform engineer
  participant Comms as Stakeholders
  IC->>Ops: stabilize customer impact
  IC->>SRE: isolate failing layer
  IC->>Comms: impact and next update time
  SRE-->>IC: hypothesis and evidence
  Ops-->>IC: mitigation result
  IC->>Comms: recovery status

Strong postmortems include:

  • Timeline with detection, mitigation, and recovery.
  • Customer impact.
  • What changed.
  • What signals were missing or misleading.
  • Why safeguards did not catch it.
  • Action items that remove a class of failure, not just one symptom.

Prompt: “A GPU pool is failing health checks after a node image update.”

Answer:

I would halt rollout, identify the affected pool and image version, and compare canary nodes with healthy nodes. I would check driver, kernel, container toolkit, device plugin, kubelet events, and DCGM exporter. If production is impacted, I would cordon affected nodes and shift traffic or roll back the image. After recovery, I would add a pre-promotion validator that starts a representative GPU workload, verifies telemetry, and blocks promotion on driver/runtime mismatch.

IncidentFirst stabilizerEvidence to split layersPrevention
API server slow, traffic healthyFreeze nonessential deploys and reduce write load.API server latency, admission errors, APF queues, controller lag, data-plane SLO.Control-plane SLOs, admission dashboards, write-load budgets.
Etcd write timeoutsProtect quorum and stop risky automation.Etcd member health, fsync latency, leader changes, DB size, API storage latency.Tested snapshot restore, compaction/defrag hygiene, disk latency alerts.
Admission webhook outagePause rollouts or use approved bypass.Webhook endpoints, failure policy, timeout, certs, selector scope, admission audit logs.Webhook SLO, fail-open/closed policy by risk, staged webhook rollout.
Terraform state lock stuckConfirm no active apply before force-unlock.CI runs, backend lock owner, timestamps, plan/apply logs, state serial.Run ownership metadata, timeout cleanup, backend-specific lock runbook.
Terraform unexpected destroyBlock apply.Plan JSON, state addresses, module path changes, provider aliases, count/for_each key changes.Plan policy for destroy/replace, moved blocks, refactor review checklist.
EndpointSlice stale during rolloutShift traffic to known-good route or pause rollout.Pod readiness transitions, EndpointSlice contents, Service selectors, Gateway route backend refs.Synthetic endpoint propagation checks and rollout gates.
Model artifact store slows scale-outShift to warm capacity or shed lower-priority load.Image pull time, artifact download time, object-store errors, node disk/network saturation, model load.Regional cache, prefetch, checksum validation, warm pools.
Retry storm after p99 spikeCut retry budget or shed load at edge.Gateway retries, client deadlines, request concurrency, Triton queue time, downstream saturation.Retry budgets, deadline propagation, overload tests.
GPU device plugin flappingCordon suspect nodes or pool.Device plugin logs, kubelet allocatable changes, DCGM, Xid/ECC, driver/runtime events.Operator canary, node quarantine automation, driver compatibility gates.
GitOps reverts emergency hotfixPause sync intentionally or commit fix.Controller events, drift diff, incident timeline, live vs desired manifests.Break-glass process and time-bounded sync pause policy.
flowchart TD
  Page[Page fires] --> Impact{User impact active}
  Impact -- yes --> Stabilize[Mitigate first]
  Impact -- no --> Preserve[Preserve evidence and scope]
  Stabilize --> Scope[Scope by layer]
  Preserve --> Scope
  Scope --> Change{Recent change}
  Change -- yes --> Rollback[Rollback or pause change]
  Change -- no --> Saturation{Saturation or dependency}
  Saturation -- yes --> Capacity[Reduce load, add warm capacity, or shed]
  Saturation -- no --> DeepDive[Debug internals and missing signal]
  Rollback --> Verify[Verify SLO recovery]
  Capacity --> Verify
  DeepDive --> Verify
  • MTTD, MTTA, MTTR.
  • Alert precision and recall.
  • Error budget burn.
  • Toil hours per week.
  • Change failure rate.
  • Rollback frequency.
  • Automation success and manual override rate.