Continuous Batching: 23x Throughput, Better Latency Across Percentiles

A single configuration change—enabling continuous batching—can transform your LLM serving from GPU-starved to saturated throughput. One production system saw 23x throughput improvement by switching from static to dynamic batching, while simultaneously reducing p95 latency by 40%. This guide breaks down how continuous batching works, compares vLLM, Triton, and TorchServe implementations, and provides production-ready configurations.

Why Continuous Batching Matters

Traditional static batching forces all requests to wait for the longest sequence in the batch to complete, leaving GPUs idle between steps. Continuous batching solves this by treating the batch as a fluid pool where requests can enter and exit at each generation step. For production systems serving variable-length requests, this translates directly to cost savings and improved user experience.

The business impact is measurable: a system serving 1,000 requests/minute with static batching might need 8 A100s. With continuous batching, the same workload often runs on 2-3 A100s—a 60-75% infrastructure cost reduction. At current cloud GPU rates ($3-4/hour per A100), that’s $15,000-$30,000 monthly savings for mid-scale deployments.

How Continuous Batching Works

Continuous batching operates on a simple principle: maximize GPU utilization by eliminating idle time between generation steps. Unlike static batching where all requests must complete all steps together, continuous batching uses dynamic scheduling to maintain constant compute.

The Mechanics

At each generation step, the continuous batching manager:

Evaluates which requests have completed their current token generation
Removes finished requests from the active batch
Adds new requests waiting in the queue
Reorders the remaining requests for optimal memory access
Executes the next forward pass for all active requests

This process happens every ~10-50ms, creating a “conveyor belt” effect where requests flow through the system continuously rather than in discrete batches.

PagedAttention: The Memory Foundation

Continuous batching’s efficiency depends critically on PagedAttention, which manages KV cache in fixed-size blocks (typically 16 tokens per block). This eliminates memory fragmentation and enables flexible memory sharing across requests—critical for dynamic batch composition.

Production Implementations: vLLM vs Triton vs TorchServe

vLLM: The Open-Source Standard

vLLM has become the de facto standard for continuous batching, offering native support with minimal configuration.

Key Features:

Automatic PagedAttention integration
Chunked prefill for latency optimization
CUDA graph support for reduced overhead
Tensor parallelism for large models

Configuration Example:

Why This Matters

The business case for continuous batching centers on infrastructure efficiency and user experience. With API pricing like OpenAI’s gpt-4o at $5.00/$15.00 per 1M tokens and Anthropic’s claude-3-5-sonnet at $3.00/$15.00 per 1M tokens, serving your own models becomes cost-competitive only when GPU utilization is maximized openai.com/pricing, docs.anthropic.com/en/docs/about-claude/models.

For self-hosted deployments, continuous batching directly impacts the bottom line. A system serving variable-length requests with static batching might achieve 40-60% GPU utilization. Continuous batching typically pushes this to 85-95%, effectively halving your required GPU count for the same throughput.

Practical Implementation

vLLM Configuration

vLLM provides the most straightforward continuous batching implementation. The key parameters are:

max_num_batched_tokens: Controls the maximum tokens processed per batch. Higher values improve throughput but increase latency variance.
enable_chunked_prefill: Essential for mixed workloads with variable prompt lengths. Prevents long prompts from blocking decode operations.
gpu_memory_utilization: Target GPU memory usage. Set to 0.9 for production, but reduce if you see memory pressure warnings.

Triton Inference Server

Triton’s dynamic batching complements vLLM’s continuous batching. Configure both layers:

# In config.pbtxt
dynamic_batching {
  max_queue_delay_microseconds: 100
  preferred_batch_size: [ 8, 16, 32 ]
}

This allows Triton to group incoming requests before passing them to vLLM, reducing scheduling overhead.

TorchServe

Code Example

Production-Ready vLLM Deployment

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs

# Configure for production throughput
engine_args = EngineArgs(
    model="meta-llama/Llama-2-7b-hf",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048,
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_num_seqs=256,
    block_size=16,
)

llm = LLM(**engine_args)

# Optimize sampling for your workload
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100,
    repetition_penalty=1.1,
)

# Process batch
prompts = [
    "The future of AI is",
    "Quantum computing will",
    "Sustainable energy means",
]

outputs = llm.generate(prompts, sampling_params)

# Monitor for preemption
# WARNING: Sequence group is preempted by PreemptionMode.SWAP
# Indicates memory pressure - solutions:
# 1. Increase gpu_memory_utilization
# 2. Decrease max_num_batched_tokens
# 3. Increase tensor_parallel_size

Triton + vLLM Integration

name: "vllm_model"
platform: "python"
max_batch_size: 0  # vLLM handles batching internally

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

parameters [
  {
    key: "enable_chunked_prefill"
    value: { string_value: "true" }
  },
  {
    key: "max_num_batched_tokens"
    value: { string_value: "2048" }
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 100
  preferred_batch_size: [ 8, 16, 32 ]
}

Common Pitfalls

Based on verified production experience, avoid these critical mistakes:

Not enabling PagedAttention: Continuous batching without PagedAttention yields suboptimal memory management and fragmentation.
Setting max_batch_tokens too low: Underutilizes GPU compute capacity. Start with 2048 and tune based on your sequence length distribution.
Ignoring preemption warnings: “Sequence group is preempted” indicates memory pressure. Solutions:
- Increase gpu_memory_utilization (max 0.95)
- Decrease max_num_batched_tokens or max_num_seqs
- Increase tensor_parallel_size for large models
Large block sizes with short sequences: Block sizes greater than 128 tokens with sequences less than 64 tokens cause excessive internal fragmentation. Use default 16-token blocks.
Missing tensor parallelism: Large models require tensor_parallel_size greater than 1. Failing to configure this causes OOM despite continuous batching.
No chunked prefill with variable prompts: Long prompts block decode operations. Always enable enable_chunked_prefill for mixed workloads.
Incorrect tokenizer padding: Use padding_side='left' for batch processing to align with causal masking requirements.
Multiple Triton instances without unique SHM: Running multiple instances requires shm-region-prefix-name to avoid shared memory conflicts.
Overloading batch size: Setting max_num_batched_tokens greater than available GPU memory causes preemption and latency spikes. Monitor with nvidia-smi during load tests.
Static batching mindset: Treating continuous batching as “set and forget.” Requires ongoing tuning as request patterns evolve.

Quick Reference

Key Parameters by Use Case

Parameter	Throughput Focus	Latency Focus	Default
`max_num_batched_tokens`	4096+	1024-2048	2048
`enable_chunked_prefill`	Required	Required	True
`gpu_memory_utilization`	0.95	0.85	0.9
`block_size`	32	16	16
`max_num_seqs`	512	128	256
Triton `max_queue_delay`	200μs	50μs	100μs

Monitoring Checklist

GPU utilization greater than 85%
Memory pressure warnings = 0
Preemption events = 0
P99 latency within SLA
Throughput stable under load

Batching strategy selector (workload characteristics → recommended config)

Interactive widget derived from “Continuous Batching: 23x Throughput, Better Latency Across Percentiles” that lets readers explore batching strategy selector (workload characteristics → recommended config).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Continuous batching transforms LLM serving from GPU-starved to saturated throughput by eliminating idle time between generation steps. The core mechanism—dynamic rearrangement of batches with immediate request replacement—delivers 2-4x throughput improvements while maintaining or improving latency percentiles.

Key implementation requirements:

PagedAttention for memory management
Chunked prefill for variable-length prompts
Proper parameter tuning for your workload

Official Documentation

Hugging Face Transformers: Continuous Batching - Official implementation guide for generate_batch() and ContinuousBatchingManager with PagedAttention integration.
NVIDIA Triton Inference Server: Dynamic Batching - Production configuration patterns for dynamic batching and concurrent model execution.
Google Cloud: Run ML Inference with vLLM on GPUs - End-to-end deployment guide for vLLM on Dataflow with GPU optimization.
NVIDIA NeMo: Deploy Models via vLLM Export - Export NeMo checkpoints to vLLM for Triton deployment.

Pricing References

OpenAI Pricing - Current rates for gpt-4o and gpt-4o-mini (verified 2024-10-10).
Anthropic Models - claude-3-5-sonnet and haiku-3.5 pricing (verified 2024-11-15).

Implementation Guides

vLLM Project - Open-source inference engine with native continuous batching support.
Double Word AI: Batching Strategies - Conceptual comparison of continuous vs dynamic batching approaches.

Next Steps

Benchmark your workload: Use the provided calculator to estimate improvements based on your current utilization and request patterns.
Enable PagedAttention: Configure attn_implementation="paged" or "sdpa_paged" in your model initialization.
Start with conservative parameters: Use max_num_batched_tokens=2048 and gpu_memory_utilization=0.9 as baseline.
Monitor for preemption: Watch logs for “Sequence group is preempted” warnings and tune parameters accordingly.
Validate with load testing: Use realistic request patterns before production rollout.

Continuous Batching: 23x Throughput, Better Latency Across Percentiles

Continuous Batching: 23x Throughput, Better Latency Across Percentiles

Why Continuous Batching Matters

How Continuous Batching Works

The Mechanics

PagedAttention: The Memory Foundation

Production Implementations: vLLM vs Triton vs TorchServe

vLLM: The Open-Source Standard

Why This Matters

Practical Implementation

vLLM Configuration

Triton Inference Server

TorchServe

Code Example

Production-Ready vLLM Deployment

Triton + vLLM Integration

Common Pitfalls

Quick Reference

Key Parameters by Use Case

Monitoring Checklist

Widget

Summary

Related Resources

Official Documentation

Pricing References

Implementation Guides

Next Steps