Skip to content

Break-Fix Automation

Design automation that detects unhealthy GPU infrastructure, quarantines it, repairs it, and safely returns it to service.

  • Reduce MTTR.
  • Reduce manual toil.
  • Avoid making incidents worse.
  • Preserve auditability.
  • Avoid draining too much capacity.
  • GPU telemetry: Xid, ECC, memory, temperature, utilization anomalies.
  • Node telemetry: kubelet, container runtime, CPU, memory, disk, kernel logs.
  • Workload telemetry: crash loops, failed inference, latency spike, readiness failures.
  • Scheduler state: pending pods, evictions, unschedulable reasons.
  • Change events: deploys, node image updates, driver changes.

Classify states:

  • Healthy
  • Suspect
  • Quarantine
  • Repairing
  • Needs human/hardware escalation
  • Validating
  • Return to service

Use confidence levels. A single noisy metric should not reboot production hardware unless policy explicitly allows it.

flowchart TD
  Alert[Alert or anomaly] --> Classify[Classify failure domain]
  Classify --> Evidence[Collect bounded evidence]
  Evidence --> Confidence{Confidence high}
  Confidence -- no --> Human[Human review]
  Confidence -- yes --> Risk{Blast radius acceptable}
  Risk -- no --> Human
  Risk -- yes --> Action[Execute approved repair]
  Action --> Verify[Verify recovery]
  Verify --> Audit[Audit log and learning item]
  Human --> Audit
  1. Detect and correlate.
  2. Check current fleet health and spare capacity.
  3. Cordon suspect node.
  4. Drain if safe and PDBs allow.
  5. Run bounded repair action.
  6. Validate with device plugin, DCGM, test workload, and service-level signals.
  7. Uncordon or escalate.
  8. Record action, reason, and result.
stateDiagram-v2
  [*] --> Alert
  Alert --> Evidence
  Evidence --> Plan
  Plan --> Approval
  Approval --> Execute: safe
  Approval --> Escalate: unsafe or unknown
  Execute --> Verify
  Verify --> Close: recovered
  Verify --> Escalate: not recovered
  Escalate --> Close
  Close --> [*]
  • Max one node per rack/pool/SKU at a time unless emergency mode.
  • Capacity headroom check before drain.
  • Exclusion labels for critical or experimental pools.
  • Manual approval for destructive repair.
  • No infinite retry loops.
  • Kill switch and audit log.
POST /repair-plans
  input: pool, node selectors, failure class, dry_run
  output: affected nodes, planned actions, risk, capacity impact

POST /repair-plans/{id}/execute
  input: approval token, concurrency, rollback policy
  output: workflow id

GET /workflows/{id}
  output: current step, events, metrics, errors
  • Automation success rate.
  • Manual override rate.
  • False positive quarantine.
  • Mean repair time.
  • Nodes by state.
  • Capacity removed by automation.
  • Repeat offender nodes.

The hard part is not detecting unhealthy nodes. The hard part is deciding when automation is allowed to act. Your answer should emphasize policy, confidence, capacity, and customer impact.

Closing:

I would rather begin with recommendation-mode automation that produces high-quality repair plans, then graduate specific low-risk classes to automatic execution once precision is proven.