Skip to content

Python and Reliability Code

Expect practical automation rather than algorithm trivia:

  • Parse logs or metrics.
  • Write a health checker.
  • Build an idempotent reconciliation loop.
  • Retry safely.
  • Rate-limit calls.
  • Handle partial failure.
  • Design CLI flags and dry-run behavior.
  • Test edge cases.
  • Make side effects explicit.
  • Separate discovery, planning, and execution.
  • Return structured results.
  • Use timeouts.
  • Use exponential backoff with jitter.
  • Make retries conditional on safe error classes.
  • Log enough to reconstruct decisions.
  • Treat external API responses as untrusted.
def reconcile(desired, observed, *, dry_run):
    plan = diff(desired, observed)
    validate(plan)
    if dry_run:
        return plan
    for step in plan.steps:
        execute_with_timeout(step)
        verify(step)
    return summarize(plan)

Explain:

  • diff is pure and testable.
  • validate enforces guardrails.
  • execute is bounded and observable.
  • verify prevents false success.
  • dry_run is not an afterthought.

“Write a tool that drains unhealthy GPU nodes.”

Clarify:

  • What defines unhealthy?
  • Can we drain nodes with running production inference?
  • What capacity headroom is required?
  • What is the max concurrent drain?
  • What is rollback?
  • Where do audit logs go?

Plan:

  1. List candidate nodes from inventory and health signals.
  2. Exclude protected pools.
  3. Check capacity and PDB constraints.
  4. Create drain plan.
  5. Execute with concurrency limit.
  6. Verify pods rescheduled and node state updated.
  7. Emit metrics and audit trail.
  • Empty input.
  • No eligible nodes.
  • One unhealthy node.
  • Many unhealthy nodes over concurrency limit.
  • Protected node.
  • PDB blocks drain.
  • API timeout.
  • Partial success.
  • Verification failure.
  • Dry run output.

When reviewing automation code, look for:

  • Missing timeout.
  • Unbounded concurrency.
  • Retry on non-idempotent operation.
  • Swallowed exception.
  • No audit record.
  • String matching fragile API output.
  • No validation between plan and execution.
  • No ownership for generated state.