Skip to content

Design Menu

  • Design a GPU inference platform for multiple internal teams.
  • Design automation to detect and repair unhealthy GPU nodes.
  • Design a safe deployment pipeline for model server changes.
  • Design observability for an inference service with strict p99 latency.
  • Design capacity management for a global GPU fleet.
  • Design a model artifact promotion and rollback system.
  • Design incident response tooling for AI infrastructure.
  1. Requirements and SLOs.
  2. Users and tenants.
  3. Workload shape: model size, request rate, latency, batchability, GPU type.
  4. Architecture: control plane, data plane, storage, network, scheduler.
  5. Deployment and rollback.
  6. Observability.
  7. Failure modes.
  8. Security and isolation.
  9. Capacity and cost.
  10. Operational model and roadmap.
  • Define contracts between teams.
  • Use progressive delivery.
  • Separate control plane failure from data plane serving.
  • Preserve debuggability during automation.
  • Design for mixed hardware pools.
  • Make rollback faster than rollout.
  • Include ownership and migration strategy.
  • Is this online inference, batch inference, or both?
  • What are p99 and availability targets?
  • Are tenants internal, external, or both?
  • Which GPUs and clouds/data centers?
  • Is multi-region failover required?
  • Are model artifacts immutable and versioned?
  • Can we reject or shed load?
  • What compliance or security boundary exists?