30-60-90 Plan
Use this when asked “How would you ramp?” or “What would you do first?”
flowchart LR D30[Days 1-30 learn systems] --> D60[Days 31-60 remove repeat failure] D60 --> D90[Days 61-90 platform primitive] D30 --> Evidence[Map owners, SLOs, deploy path, incidents] D60 --> Fix[Ship guarded automation or rollout gate] D90 --> Scale[Adopted paved road with metrics]
First 30 Days: Learn The Operating System
Section titled “First 30 Days: Learn The Operating System”Goals:
- Map the inference production architecture: traffic path, deployment path, control planes, hardware pools, ownership boundaries.
- Understand top incidents, toil sources, paging patterns, and current SLOs.
- Shadow incident response and deploys.
- Read runbooks, dashboards, IaC modules, CI/CD pipelines, and fleet inventory systems.
- Ship one low-risk improvement: dashboard cleanup, runbook correction, validation script, alert tuning, or test gap.
Questions:
- What are the top three repeat operational failures?
- Which manual actions are frequent, risky, and automatable?
- Where do we lack confidence during deploys or repair?
- Which metrics are trusted, and which are decorative?
Days 31-60: Remove A Repeat Failure Mode
Section titled “Days 31-60: Remove A Repeat Failure Mode”Goals:
- Pick one painful workflow with clear ownership and measurable impact.
- Design automation with prechecks, dry run, change record, limited blast radius, and rollback.
- Add missing signals around the workflow.
- Partner with the teams who own the affected systems.
Example targets:
- Node quarantine and recovery for unhealthy GPU nodes.
- Canary validator for model server rollouts.
- Capacity and utilization report for GPU pools.
- Automated drift detection for IaC/config.
- Incident classification and runbook launcher.
Days 61-90: Make It A Platform Primitive
Section titled “Days 61-90: Make It A Platform Primitive”Goals:
- Move from script to supported workflow.
- Add tests, documentation, ownership, dashboards, and escalation policy.
- Build adoption plan for adjacent teams.
- Propose roadmap based on data: top toil, top incident classes, capacity risk, reliability gaps.
Senior/staff signal:
I would avoid starting with a grand rewrite. I would first learn the failure distribution, remove one repeat class with a well-instrumented workflow, then use that credibility and data to shape the larger platform roadmap.
Success Metrics
Section titled “Success Metrics”- MTTR reduction for a targeted incident class.
- Fewer manual repair steps.
- Lower false-positive alert rate.
- Better rollout confidence and faster rollback.
- Higher GPU utilization without hurting SLO.
- Clearer ownership and fewer ambiguous escalations.