Python and Reliability Code
Likely Coding Style
Section titled “Likely Coding Style”Expect practical automation rather than algorithm trivia:
- Parse logs or metrics.
- Write a health checker.
- Build an idempotent reconciliation loop.
- Retry safely.
- Rate-limit calls.
- Handle partial failure.
- Design CLI flags and dry-run behavior.
- Test edge cases.
Senior Coding Principles
Section titled “Senior Coding Principles”- Make side effects explicit.
- Separate discovery, planning, and execution.
- Return structured results.
- Use timeouts.
- Use exponential backoff with jitter.
- Make retries conditional on safe error classes.
- Log enough to reconstruct decisions.
- Treat external API responses as untrusted.
Reconciliation Pattern
Section titled “Reconciliation Pattern”def reconcile(desired, observed, *, dry_run):
plan = diff(desired, observed)
validate(plan)
if dry_run:
return plan
for step in plan.steps:
execute_with_timeout(step)
verify(step)
return summarize(plan)
Explain:
diffis pure and testable.validateenforces guardrails.executeis bounded and observable.verifyprevents false success.dry_runis not an afterthought.
Example Prompt
Section titled “Example Prompt”“Write a tool that drains unhealthy GPU nodes.”
Clarify:
- What defines unhealthy?
- Can we drain nodes with running production inference?
- What capacity headroom is required?
- What is the max concurrent drain?
- What is rollback?
- Where do audit logs go?
Plan:
- List candidate nodes from inventory and health signals.
- Exclude protected pools.
- Check capacity and PDB constraints.
- Create drain plan.
- Execute with concurrency limit.
- Verify pods rescheduled and node state updated.
- Emit metrics and audit trail.
Testing Checklist
Section titled “Testing Checklist”- Empty input.
- No eligible nodes.
- One unhealthy node.
- Many unhealthy nodes over concurrency limit.
- Protected node.
- PDB blocks drain.
- API timeout.
- Partial success.
- Verification failure.
- Dry run output.
Code Review Signals
Section titled “Code Review Signals”When reviewing automation code, look for:
- Missing timeout.
- Unbounded concurrency.
- Retry on non-idempotent operation.
- Swallowed exception.
- No audit record.
- String matching fragile API output.
- No validation between plan and execution.
- No ownership for generated state.