Break-Fix Automation
Prompt
Section titled “Prompt”Design automation that detects unhealthy GPU infrastructure, quarantines it, repairs it, and safely returns it to service.
- Reduce MTTR.
- Reduce manual toil.
- Avoid making incidents worse.
- Preserve auditability.
- Avoid draining too much capacity.
Inputs
Section titled “Inputs”- GPU telemetry: Xid, ECC, memory, temperature, utilization anomalies.
- Node telemetry: kubelet, container runtime, CPU, memory, disk, kernel logs.
- Workload telemetry: crash loops, failed inference, latency spike, readiness failures.
- Scheduler state: pending pods, evictions, unschedulable reasons.
- Change events: deploys, node image updates, driver changes.
Decision Engine
Section titled “Decision Engine”Classify states:
- Healthy
- Suspect
- Quarantine
- Repairing
- Needs human/hardware escalation
- Validating
- Return to service
Use confidence levels. A single noisy metric should not reboot production hardware unless policy explicitly allows it.
flowchart TD
Alert[Alert or anomaly] --> Classify[Classify failure domain]
Classify --> Evidence[Collect bounded evidence]
Evidence --> Confidence{Confidence high}
Confidence -- no --> Human[Human review]
Confidence -- yes --> Risk{Blast radius acceptable}
Risk -- no --> Human
Risk -- yes --> Action[Execute approved repair]
Action --> Verify[Verify recovery]
Verify --> Audit[Audit log and learning item]
Human --> Audit
Workflow
Section titled “Workflow”- Detect and correlate.
- Check current fleet health and spare capacity.
- Cordon suspect node.
- Drain if safe and PDBs allow.
- Run bounded repair action.
- Validate with device plugin, DCGM, test workload, and service-level signals.
- Uncordon or escalate.
- Record action, reason, and result.
stateDiagram-v2 [*] --> Alert Alert --> Evidence Evidence --> Plan Plan --> Approval Approval --> Execute: safe Approval --> Escalate: unsafe or unknown Execute --> Verify Verify --> Close: recovered Verify --> Escalate: not recovered Escalate --> Close Close --> [*]
Guardrails
Section titled “Guardrails”- Max one node per rack/pool/SKU at a time unless emergency mode.
- Capacity headroom check before drain.
- Exclusion labels for critical or experimental pools.
- Manual approval for destructive repair.
- No infinite retry loops.
- Kill switch and audit log.
API Shape
Section titled “API Shape”POST /repair-plans
input: pool, node selectors, failure class, dry_run
output: affected nodes, planned actions, risk, capacity impact
POST /repair-plans/{id}/execute
input: approval token, concurrency, rollback policy
output: workflow id
GET /workflows/{id}
output: current step, events, metrics, errors
Observability
Section titled “Observability”- Automation success rate.
- Manual override rate.
- False positive quarantine.
- Mean repair time.
- Nodes by state.
- Capacity removed by automation.
- Repeat offender nodes.
Senior/Staff Discussion
Section titled “Senior/Staff Discussion”The hard part is not detecting unhealthy nodes. The hard part is deciding when automation is allowed to act. Your answer should emphasize policy, confidence, capacity, and customer impact.
Closing:
I would rather begin with recommendation-mode automation that produces high-quality repair plans, then graduate specific low-risk classes to automatic execution once precision is proven.