Skip to content

Design Menu

Likely Prompts

Design a GPU inference platform for multiple internal teams.
Design automation to detect and repair unhealthy GPU nodes.
Design a safe deployment pipeline for model server changes.
Design observability for an inference service with strict p99 latency.
Design capacity management for a global GPU fleet.
Design a model artifact promotion and rollback system.
Design incident response tooling for AI infrastructure.

Universal Structure

Requirements and SLOs.
Users and tenants.
Workload shape: model size, request rate, latency, batchability, GPU type.
Architecture: control plane, data plane, storage, network, scheduler.
Deployment and rollback.
Observability.
Failure modes.
Security and isolation.
Capacity and cost.
Operational model and roadmap.

Staff-Level Differentiators

Define contracts between teams.
Use progressive delivery.
Separate control plane failure from data plane serving.
Preserve debuggability during automation.
Design for mixed hardware pools.
Make rollback faster than rollout.
Include ownership and migration strategy.

Questions To Ask

Is this online inference, batch inference, or both?
What are p99 and availability targets?
Are tenants internal, external, or both?
Which GPUs and clouds/data centers?
Is multi-region failover required?
Are model artifacts immutable and versioned?
Can we reject or shed load?
What compliance or security boundary exists?