Skip to content

Scheduling and Autoscaling

GPU capacity is lumpy, scarce, expensive, and slow to warm. Good scheduling is not just “find a node”; it is matching workload shape, SLO, tenant priority, hardware SKU, topology, and rollout risk.

flowchart LR
  Workload[Workload demand] --> Queue[Kueue or admission queue]
  Queue --> Quota[Quota and cohort borrowing]
  Quota --> Scheduler[Kubernetes scheduler]
  Scheduler --> Capacity{GPU capacity available}
  Capacity -- yes --> Bind[Bind pod]
  Capacity -- no --> Autoscaler[Karpenter or cluster autoscaler]
  Autoscaler --> Node[Provision GPU node]
  Node --> Bind
Tool/APIUse
Native schedulerHandles resource fit, constraints, affinity, taints, priorities.
Topology spreadAvoids correlated failure or hot spots.
Priority/preemptionProtects critical inference workloads.
KueueKubernetes-native quota and admission for queued/batch workloads.
Volcano/YuniKornBatch/HPC schedulers seen in AI/HPC environments.
KarpenterFast node provisioning for unschedulable pods, cloud dependent.
Cluster AutoscalerNode-group based autoscaling, common and stable.
HPA/KEDAReplica scaling from metrics/events.
VPAResource recommendation; use carefully with production serving.

Kueue is relevant because GPU clusters often need quota-aware admission, fair sharing, and preemption. It decides when workloads should wait, start, or be preempted based on quota and queue policy.

Use Kueue language for:

  • Batch inference.
  • Training or evaluation jobs sharing GPU pools.
  • Tenant quotas.
  • Avoiding partial admission for gang-like workloads.
  • Priority independent of pod priority.

Karpenter:

  • Watches unschedulable pods.
  • Provisions nodes that fit the pods.
  • Can improve flexibility and consolidation in cloud environments.
  • Strong for heterogeneous instance selection.

Cluster Autoscaler:

  • Scales node groups.
  • Mature and widely used.
  • Easier mental model in fixed pools.

Staff answer:

For GPU inference, the autoscaler choice is less important than the full capacity loop: pending signal, node provisioning latency, driver/operator readiness, model load/warmup time, and traffic admission. If warmup dominates, autoscaling alone will not save p99 during a burst.

Use more than CPU:

  • Request queue depth.
  • Queue wait time.
  • p95/p99 latency.
  • In-flight requests.
  • GPU utilization.
  • GPU memory headroom.
  • Model server batch wait.
  • Token generation rate for LLM-style serving.
  • Pending pods and unschedulable reasons.
flowchart TD
  Signal[Scaling decision] --> Demand[Request rate and concurrency]
  Signal --> Queue[Queue delay and backlog]
  Signal --> GPU[GPU utilization and memory]
  Signal --> Warmup[Model load and warmup time]
  Signal --> Cost[Cost and quota]
  Demand --> Decision[Scale, shed, or route]
  Queue --> Decision
  GPU --> Decision
  Warmup --> Decision
  Cost --> Decision

For inference:

  • Keep spare GPU nodes validated and ready.
  • Keep model artifacts cached.
  • Keep a minimum number of warm replicas.
  • Use predictive scaling for known events.
  • Allow admission control or load shedding during extreme spikes.

Represent capacity by:

  • GPU SKU.
  • GPU count per node.
  • MIG profile if used.
  • Memory per GPU or MIG slice.
  • Network/storage topology.
  • Tenant quota.
  • Failure domain.
  • Driver/runtime compatibility.

Question: “Traffic doubles in five minutes and p99 explodes.”

Answer:

  1. Confirm whether bottleneck is queue wait, GPU compute, CPU, network, or downstream.
  2. Check autoscaling state: replicas, pending pods, node provisioning, warmup.
  3. Shift traffic to warm capacity or shed low-priority traffic.
  4. Tune batching only if it improves throughput without violating p99.
  5. Add capacity if node and model warmup can land in time.
  6. After incident, add predictive scaling, warm pools, queue-based scaling, and load-test gates.