Modern LLM deployments face a fundamental tension: users demand sub-second response times (latency), while operations teams push for maximum hardware utilization (throughput). Request batching is the architectural lever that reconciles these competing demands. When implemented correctly, batching can reduce inference costs by 40-70% while maintaining strict SLA compliance. When misconfigured, it creates cascading failures and violates latency budgets.
The financial impact of batching decisions is immediate and measurable. Consider a production system processing 100,000 requests per day with an average context of 2,000 input tokens and 500 output tokens per request:
Without batching (request-by-request):
Hardware utilization: ~25-30% on typical GPUs
Effective cost per 1M tokens: Full retail price
Monthly cost at scale: $15,000-$20,000
With optimized batching:
Hardware utilization: 70-85%
Effective cost per 1M tokens: 40-60% discount through throughput gains
Monthly cost at scale: $6,000-$9,000
This 60% cost reduction doesnât require model changes or quality tradeoffsâitâs pure architectural efficiency. However, aggressive batching introduces latency overhead. A poorly configured 32-request batch can add 500-800ms of queue time, violating user-facing SLAs.
The challenge compounds with heterogeneous workloads: a mix of real-time chat requests, background document processing, and batch analytics all competing for the same compute resources. Each workload has different latency tolerance and priority, requiring sophisticated scheduling beyond simple FIFO queues.
The financial impact of batching decisions is immediate and measurable. Consider a production system processing 100,000 requests per day with an average context of 2,000 input tokens and 500 output tokens per request:
Without batching (request-by-request):
Hardware utilization: ~25-30% on typical GPUs
Effective cost per 1M tokens: Full retail price
Monthly cost at scale: $15,000-$20,000
With optimized batching:
Hardware utilization: 70-85%
Effective cost per 1M tokens: 40-60% discount through throughput gains
Monthly cost at scale: $6,000-$9,000
This 60% cost reduction doesnât require model changes or quality tradeoffsâitâs pure architectural efficiency. However, aggressive batching introduces latency overhead. A poorly configured 32-request batch can add 500-800ms of queue time, violating user-facing SLAs.
The challenge compounds with heterogeneous workloads: a mix of real-time chat requests, background document processing, and batch analytics all competing for the same compute resources. Each workload has different latency tolerance and priority, requiring sophisticated scheduling beyond simple FIFO queues.
The optimal batch size isnât staticâit changes based on request patterns, SLA requirements, and hardware utilization. Modern systems use three complementary strategies:
1. Memory-Aware Dynamic Batching
This approach continuously monitors GPU memory utilization and adjusts batch size in real-time to prevent OOM errors while maximizing throughput. The system maintains SLA constraints by tracking latency budgets and adjusting accordingly.
2. Fairness-Aware Batch Formation
This prevents decode stalls by balancing prefill and decode tasks. Rather than prioritizing decode tasks excessively (which leads to underutilized decode slack and unnecessary prefill queuing delays), it enforces fair resource allocation between prefill and decode tasks, reducing TTFT tail latency by up to 2.29x while maintaining TPOT SLOs.
3. SLO-Aware Scheduling
This prioritizes requests based on their deadline constraints. Decode requests close to missing their Time-Between-Tokens (TBT) deadlines are prioritized, while prefill requests are reordered based on prompt length to reduce Time-To-First-Token (TTFT) delays.
Request batching architecture is the most impactful lever for reducing LLM inference costs while maintaining performance. The key insight is that optimal batching is dynamic, not staticâit must adapt to request patterns, SLA constraints, and hardware utilization in real-time.
The research from arxiv.org and arxiv.org confirms that dynamic, memory-aware batching with fairness constraints delivers both throughput gains (8-28%) and latency improvements while maintaining SLA compliance. Systems that ignore these principles risk both cost overruns and user experience degradation.
Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching arxiv.org
Demonstrates 8-28% throughput gains and 22% capacity improvements over static batching
FairBatching: Fairness-Aware Batch Formation for LLM Inference arxiv.org
Reduces TTFT tail latency by 2.29x while maintaining TPOT SLOs, achieving 20% single-node and 54% cluster-level capacity improvements
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents arxiv.org
Proves work-conserving algorithms achieve maximum throughput; validates Orca and Sarathi-serve as throughput-optimal
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice arxiv.org
Introduces SLAI scheduler reducing median TTFT by 53% and increasing capacity by 26%
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints arxiv.org
Provides theoretical framework for near-optimal scheduling under memory constraints