30-60-90 Plan

Use this when asked “How would you ramp?” or “What would you do first?”

flowchart LR
  D30[Days 1-30 learn systems] --> D60[Days 31-60 remove repeat failure]
  D60 --> D90[Days 61-90 platform primitive]
  D30 --> Evidence[Map owners, SLOs, deploy path, incidents]
  D60 --> Fix[Ship guarded automation or rollout gate]
  D90 --> Scale[Adopted paved road with metrics]

First 30 Days: Learn The Operating System

Goals:

Map the inference production architecture: traffic path, deployment path, control planes, hardware pools, ownership boundaries.
Understand top incidents, toil sources, paging patterns, and current SLOs.
Shadow incident response and deploys.
Read runbooks, dashboards, IaC modules, CI/CD pipelines, and fleet inventory systems.
Ship one low-risk improvement: dashboard cleanup, runbook correction, validation script, alert tuning, or test gap.

Questions:

What are the top three repeat operational failures?
Which manual actions are frequent, risky, and automatable?
Where do we lack confidence during deploys or repair?
Which metrics are trusted, and which are decorative?

Days 31-60: Remove A Repeat Failure Mode

Goals:

Pick one painful workflow with clear ownership and measurable impact.
Design automation with prechecks, dry run, change record, limited blast radius, and rollback.
Add missing signals around the workflow.
Partner with the teams who own the affected systems.

Example targets:

Node quarantine and recovery for unhealthy GPU nodes.
Canary validator for model server rollouts.
Capacity and utilization report for GPU pools.
Automated drift detection for IaC/config.
Incident classification and runbook launcher.

Days 61-90: Make It A Platform Primitive

Goals:

Move from script to supported workflow.
Add tests, documentation, ownership, dashboards, and escalation policy.
Build adoption plan for adjacent teams.
Propose roadmap based on data: top toil, top incident classes, capacity risk, reliability gaps.

Senior/staff signal:

I would avoid starting with a grand rewrite. I would first learn the failure distribution, remove one repeat class with a well-instrumented workflow, then use that credibility and data to shape the larger platform roadmap.

Success Metrics

MTTR reduction for a targeted incident class.
Fewer manual repair steps.
Lower false-positive alert rate.
Better rollout confidence and faster rollback.
Higher GPU utilization without hurting SLO.
Clearer ownership and fewer ambiguous escalations.