Skip to content
- Design a GPU inference platform for multiple internal teams.
- Design automation to detect and repair unhealthy GPU nodes.
- Design a safe deployment pipeline for model server changes.
- Design observability for an inference service with strict p99 latency.
- Design capacity management for a global GPU fleet.
- Design a model artifact promotion and rollback system.
- Design incident response tooling for AI infrastructure.
- Requirements and SLOs.
- Users and tenants.
- Workload shape: model size, request rate, latency, batchability, GPU type.
- Architecture: control plane, data plane, storage, network, scheduler.
- Deployment and rollback.
- Observability.
- Failure modes.
- Security and isolation.
- Capacity and cost.
- Operational model and roadmap.
- Define contracts between teams.
- Use progressive delivery.
- Separate control plane failure from data plane serving.
- Preserve debuggability during automation.
- Design for mixed hardware pools.
- Make rollback faster than rollout.
- Include ownership and migration strategy.
- Is this online inference, batch inference, or both?
- What are p99 and availability targets?
- Are tenants internal, external, or both?
- Which GPUs and clouds/data centers?
- Is multi-region failover required?
- Are model artifacts immutable and versioned?
- Can we reject or shed load?
- What compliance or security boundary exists?