Skip to content

Triton Batching and Scheduling

Batching is the core Triton tradeoff: increase accelerator utilization and throughput by grouping requests, while risking queue delay and p99 regressions.

The senior answer is never “turn on dynamic batching.” It is:

What request shapes can batch safely, what p99 can we afford, what queue policy protects overload, and how do we validate goodput instead of raw throughput?

Dynamic batching combines compatible requests into a batch at runtime.

sequenceDiagram
  participant C1 as Client A
  participant C2 as Client B
  participant Q as Triton queue
  participant B as Dynamic batcher
  participant I as Model instance
  C1->>Q: request shape X
  C2->>Q: compatible request shape X
  Q->>B: wait within max queue delay
  B->>I: batch size 2
  I-->>B: batched result
  B-->>C1: response A
  B-->>C2: response B

Minimal conceptual config:

dynamic_batching { }

More realistic:

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 2000
}

Operational meaning:

  • preferred_batch_size guides batch formation.
  • max_queue_delay_microseconds bounds waiting.
  • The “right” values depend on model, GPU, traffic, SLO, and request cost distribution.

Total request latency often becomes:

gateway wait
+ Triton queue delay
+ batch wait
+ backend execution
+ response serialization

If p99 worsens after batching, split the latency:

  • Did queue time increase?
  • Did GPU execution time per request decrease?
  • Did batch size distribution change?
  • Did request mix change?
  • Did timeout/retry behavior amplify load?

Preferred batch sizes are not magic. If traffic does not naturally fill those sizes within the queue delay budget, Triton may run smaller batches anyway or wait too long.

Strong answer:

I would inspect actual batch size histograms and queue delay before changing preferred batch sizes.

Preserving response order can matter for some clients, but it can also constrain throughput. Use it only when the client contract needs it.

Production systems should consider:

  • Max queue size.
  • Queue timeout.
  • Reject vs delay under overload.
  • Priority levels.
  • Tenant or workload class policy.

Senior stance:

Rejecting early can be better than letting stale requests consume GPU after client deadlines have passed.

Sequence batching is for stateful sequences where related requests must preserve state/order, such as streaming/stateful models.

Key concepts:

  • Sequence ID.
  • Start/end signals.
  • Correlation between requests.
  • Scheduler maintains sequence state.

Trap:

Sequence batching is not a generic performance knob. It is for models with sequence semantics. Misusing it can reduce concurrency and complicate failures.

Ensembles chain models or preprocessing/postprocessing steps inside Triton.

Use cases:

  • Preprocess -> model -> postprocess.
  • Model A -> model B.
  • Tokenization or feature transforms.

Tradeoffs:

  • Fewer client round trips.
  • Centralized pipeline config.
  • Harder per-stage rollout if not versioned carefully.
  • One slow stage can dominate p99.

Multiple models or instances can execute concurrently, but concurrency is bounded by:

  • GPU memory.
  • Compute saturation.
  • Backend thread pools.
  • Model instance count.
  • CPU preprocessing.
  • Interconnect/topology.
  • CUDA stream behavior.

Interview trap:

“Concurrent” does not mean “free.” Co-location can improve utilization or destroy p99 depending on workload shape.

Increasing instance count can help if:

  • Single instance underutilizes GPU.
  • Requests have CPU/IO gaps.
  • Model is small enough.
  • Memory headroom exists.

It can hurt if:

  • GPU is already saturated.
  • Memory becomes tight.
  • Context switching/contention grows.
  • Batching efficiency falls.

Traditional dynamic batching is not the whole story for LLMs. LLM serving often needs:

  • Continuous/inflight batching.
  • KV cache management.
  • Paged attention.
  • Separate prefill/decode considerations.
  • Token-level throughput metrics.

Use TensorRT-LLM backend vocabulary when discussing modern NVIDIA LLM serving.

flowchart TD
  Incoming[Incoming prompts] --> Admission{Token and KV budget available}
  Admission -- no --> Reject[Queue, shed, or downgrade]
  Admission -- yes --> Prefill[Prefill batch]
  Prefill --> Decode[Decode iteration]
  Decode --> Emit[Stream tokens]
  Emit --> Done{Sequence complete}
  Done -- no --> Decode
  Done -- yes --> Release[Release KV cache]
  Release --> Capacity[Capacity returns]
  Capacity --> Admission

Prompt: “After enabling dynamic batching, throughput increased 30%, but p99 doubled.”

Answer:

  1. Confirm same traffic mix and model version.
  2. Split latency into queue, compute, network, and client retry.
  3. Inspect actual batch sizes and queue delay.
  4. Check timeout and retry rates.
  5. Compare GPU utilization, memory, and CPU preprocessing.
  6. Reduce max queue delay or change preferred sizes.
  7. Consider separate low-latency and throughput-oriented deployments.
  8. Roll back if SLO impact exceeds budget.

For mixed tenants:

  • Put low-latency traffic on a tighter queue delay.
  • Put batch/offline traffic on larger batches.
  • Use priority and queue limits.
  • Measure goodput per tenant.
  • Keep noisy-neighbor controls explicit.

I tune Triton batching by workload class. For strict online p99, I keep queue delay tight and may sacrifice raw throughput. For batch or lower-priority workloads, I allow larger batches. I would prove the tradeoff with perf_analyzer, production-like traffic, and SLO/goodput metrics.