Scheduling and Autoscaling

The Core Problem

GPU capacity is lumpy, scarce, expensive, and slow to warm. Good scheduling is not just “find a node”; it is matching workload shape, SLO, tenant priority, hardware SKU, topology, and rollout risk.

flowchart LR
  Workload[Workload demand] --> Queue[Kueue or admission queue]
  Queue --> Quota[Quota and cohort borrowing]
  Quota --> Scheduler[Kubernetes scheduler]
  Scheduler --> Capacity{GPU capacity available}
  Capacity -- yes --> Bind[Bind pod]
  Capacity -- no --> Autoscaler[Karpenter or cluster autoscaler]
  Autoscaler --> Node[Provision GPU node]
  Node --> Bind

Scheduling Tools

Tool/API	Use
Native scheduler	Handles resource fit, constraints, affinity, taints, priorities.
Topology spread	Avoids correlated failure or hot spots.
Priority/preemption	Protects critical inference workloads.
Kueue	Kubernetes-native quota and admission for queued/batch workloads.
Volcano/YuniKorn	Batch/HPC schedulers seen in AI/HPC environments.
Karpenter	Fast node provisioning for unschedulable pods, cloud dependent.
Cluster Autoscaler	Node-group based autoscaling, common and stable.
HPA/KEDA	Replica scaling from metrics/events.
VPA	Resource recommendation; use carefully with production serving.

Kueue For AI/ML

Kueue is relevant because GPU clusters often need quota-aware admission, fair sharing, and preemption. It decides when workloads should wait, start, or be preempted based on quota and queue policy.

Use Kueue language for:

Batch inference.
Training or evaluation jobs sharing GPU pools.
Tenant quotas.
Avoiding partial admission for gang-like workloads.
Priority independent of pod priority.

Karpenter vs Cluster Autoscaler

Karpenter:

Watches unschedulable pods.
Provisions nodes that fit the pods.
Can improve flexibility and consolidation in cloud environments.
Strong for heterogeneous instance selection.

Cluster Autoscaler:

Scales node groups.
Mature and widely used.
Easier mental model in fixed pools.

Staff answer:

For GPU inference, the autoscaler choice is less important than the full capacity loop: pending signal, node provisioning latency, driver/operator readiness, model load/warmup time, and traffic admission. If warmup dominates, autoscaling alone will not save p99 during a burst.

Scaling Signals

Use more than CPU:

Request queue depth.
Queue wait time.
p95/p99 latency.
In-flight requests.
GPU utilization.
GPU memory headroom.
Model server batch wait.
Token generation rate for LLM-style serving.
Pending pods and unschedulable reasons.

flowchart TD
  Signal[Scaling decision] --> Demand[Request rate and concurrency]
  Signal --> Queue[Queue delay and backlog]
  Signal --> GPU[GPU utilization and memory]
  Signal --> Warmup[Model load and warmup time]
  Signal --> Cost[Cost and quota]
  Demand --> Decision[Scale, shed, or route]
  Queue --> Decision
  GPU --> Decision
  Warmup --> Decision
  Cost --> Decision

Warm Pools

For inference:

Keep spare GPU nodes validated and ready.
Keep model artifacts cached.
Keep a minimum number of warm replicas.
Use predictive scaling for known events.
Allow admission control or load shedding during extreme spikes.

Capacity Model

Represent capacity by:

GPU SKU.
GPU count per node.
MIG profile if used.
Memory per GPU or MIG slice.
Network/storage topology.
Tenant quota.
Failure domain.
Driver/runtime compatibility.

Interview Drill

Question: “Traffic doubles in five minutes and p99 explodes.”

Answer:

Confirm whether bottleneck is queue wait, GPU compute, CPU, network, or downstream.
Check autoscaling state: replicas, pending pods, node provisioning, warmup.
Shift traffic to warm capacity or shed low-priority traffic.
Tune batching only if it improves throughput without violating p99.
Add capacity if node and model warmup can land in time.
After incident, add predictive scaling, warm pools, queue-based scaling, and load-test gates.