Skip to content

30-60-90 Plan

Use this when asked “How would you ramp?” or “What would you do first?”

flowchart LR
  D30[Days 1-30 learn systems] --> D60[Days 31-60 remove repeat failure]
  D60 --> D90[Days 61-90 platform primitive]
  D30 --> Evidence[Map owners, SLOs, deploy path, incidents]
  D60 --> Fix[Ship guarded automation or rollout gate]
  D90 --> Scale[Adopted paved road with metrics]

Goals:

  • Map the inference production architecture: traffic path, deployment path, control planes, hardware pools, ownership boundaries.
  • Understand top incidents, toil sources, paging patterns, and current SLOs.
  • Shadow incident response and deploys.
  • Read runbooks, dashboards, IaC modules, CI/CD pipelines, and fleet inventory systems.
  • Ship one low-risk improvement: dashboard cleanup, runbook correction, validation script, alert tuning, or test gap.

Questions:

  • What are the top three repeat operational failures?
  • Which manual actions are frequent, risky, and automatable?
  • Where do we lack confidence during deploys or repair?
  • Which metrics are trusted, and which are decorative?

Goals:

  • Pick one painful workflow with clear ownership and measurable impact.
  • Design automation with prechecks, dry run, change record, limited blast radius, and rollback.
  • Add missing signals around the workflow.
  • Partner with the teams who own the affected systems.

Example targets:

  • Node quarantine and recovery for unhealthy GPU nodes.
  • Canary validator for model server rollouts.
  • Capacity and utilization report for GPU pools.
  • Automated drift detection for IaC/config.
  • Incident classification and runbook launcher.

Goals:

  • Move from script to supported workflow.
  • Add tests, documentation, ownership, dashboards, and escalation policy.
  • Build adoption plan for adjacent teams.
  • Propose roadmap based on data: top toil, top incident classes, capacity risk, reliability gaps.

Senior/staff signal:

I would avoid starting with a grand rewrite. I would first learn the failure distribution, remove one repeat class with a well-instrumented workflow, then use that credibility and data to shape the larger platform roadmap.

  • MTTR reduction for a targeted incident class.
  • Fewer manual repair steps.
  • Lower false-positive alert rate.
  • Better rollout confidence and faster rollback.
  • Higher GPU utilization without hurting SLO.
  • Clearer ownership and fewer ambiguous escalations.