Scheduling and Autoscaling
The Core Problem
Section titled “The Core Problem”GPU capacity is lumpy, scarce, expensive, and slow to warm. Good scheduling is not just “find a node”; it is matching workload shape, SLO, tenant priority, hardware SKU, topology, and rollout risk.
flowchart LR
Workload[Workload demand] --> Queue[Kueue or admission queue]
Queue --> Quota[Quota and cohort borrowing]
Quota --> Scheduler[Kubernetes scheduler]
Scheduler --> Capacity{GPU capacity available}
Capacity -- yes --> Bind[Bind pod]
Capacity -- no --> Autoscaler[Karpenter or cluster autoscaler]
Autoscaler --> Node[Provision GPU node]
Node --> Bind
Scheduling Tools
Section titled “Scheduling Tools”| Tool/API | Use |
|---|---|
| Native scheduler | Handles resource fit, constraints, affinity, taints, priorities. |
| Topology spread | Avoids correlated failure or hot spots. |
| Priority/preemption | Protects critical inference workloads. |
| Kueue | Kubernetes-native quota and admission for queued/batch workloads. |
| Volcano/YuniKorn | Batch/HPC schedulers seen in AI/HPC environments. |
| Karpenter | Fast node provisioning for unschedulable pods, cloud dependent. |
| Cluster Autoscaler | Node-group based autoscaling, common and stable. |
| HPA/KEDA | Replica scaling from metrics/events. |
| VPA | Resource recommendation; use carefully with production serving. |
Kueue For AI/ML
Section titled “Kueue For AI/ML”Kueue is relevant because GPU clusters often need quota-aware admission, fair sharing, and preemption. It decides when workloads should wait, start, or be preempted based on quota and queue policy.
Use Kueue language for:
- Batch inference.
- Training or evaluation jobs sharing GPU pools.
- Tenant quotas.
- Avoiding partial admission for gang-like workloads.
- Priority independent of pod priority.
Karpenter vs Cluster Autoscaler
Section titled “Karpenter vs Cluster Autoscaler”Karpenter:
- Watches unschedulable pods.
- Provisions nodes that fit the pods.
- Can improve flexibility and consolidation in cloud environments.
- Strong for heterogeneous instance selection.
Cluster Autoscaler:
- Scales node groups.
- Mature and widely used.
- Easier mental model in fixed pools.
Staff answer:
For GPU inference, the autoscaler choice is less important than the full capacity loop: pending signal, node provisioning latency, driver/operator readiness, model load/warmup time, and traffic admission. If warmup dominates, autoscaling alone will not save p99 during a burst.
Scaling Signals
Section titled “Scaling Signals”Use more than CPU:
- Request queue depth.
- Queue wait time.
- p95/p99 latency.
- In-flight requests.
- GPU utilization.
- GPU memory headroom.
- Model server batch wait.
- Token generation rate for LLM-style serving.
- Pending pods and unschedulable reasons.
flowchart TD Signal[Scaling decision] --> Demand[Request rate and concurrency] Signal --> Queue[Queue delay and backlog] Signal --> GPU[GPU utilization and memory] Signal --> Warmup[Model load and warmup time] Signal --> Cost[Cost and quota] Demand --> Decision[Scale, shed, or route] Queue --> Decision GPU --> Decision Warmup --> Decision Cost --> Decision
Warm Pools
Section titled “Warm Pools”For inference:
- Keep spare GPU nodes validated and ready.
- Keep model artifacts cached.
- Keep a minimum number of warm replicas.
- Use predictive scaling for known events.
- Allow admission control or load shedding during extreme spikes.
Capacity Model
Section titled “Capacity Model”Represent capacity by:
- GPU SKU.
- GPU count per node.
- MIG profile if used.
- Memory per GPU or MIG slice.
- Network/storage topology.
- Tenant quota.
- Failure domain.
- Driver/runtime compatibility.
Interview Drill
Section titled “Interview Drill”Question: “Traffic doubles in five minutes and p99 explodes.”
Answer:
- Confirm whether bottleneck is queue wait, GPU compute, CPU, network, or downstream.
- Check autoscaling state: replicas, pending pods, node provisioning, warmup.
- Shift traffic to warm capacity or shed low-priority traffic.
- Tune batching only if it improves throughput without violating p99.
- Add capacity if node and model warmup can land in time.
- After incident, add predictive scaling, warm pools, queue-based scaling, and load-test gates.