Skip to content
GitHubX/TwitterRSS

Batching at Scale: Throughput Optimization for High-Volume Workloads

Batching at Scale: Throughput Optimization for High-Volume Workloads

Section titled “Batching at Scale: Throughput Optimization for High-Volume Workloads”

A misconfigured batch size can silently destroy your unit economics. One deployment processing 100,000 requests daily with a batch size of 1 instead of 16 is burning 60-70% more GPU capacity than necessary—translating to tens of thousands in wasted compute costs monthly. Batching isn’t just a performance optimization; it’s a financial imperative for any production LLM system handling high-volume workloads.

Batching directly impacts your bottom line in three measurable ways. First, compute utilization: GPUs process matrix operations in parallel; small batches leave cores idle. Second, cost efficiency: Azure OpenAI’s Batch API offers 50% cost reduction for asynchronous workloads with 24-hour turnaround windows. Third, infrastructure footprint: Proper batching can reduce GPU requirements by 50% through quantization and memory optimization, as demonstrated in Google Cloud’s benchmarks where AWQ quantization enabled Llama-2 13B to run on a single L4 GPU instead of two.

The throughput-latency tradeoff is non-negotiable. Large batches increase throughput linearly but introduce latency penalties due to prefill/decode interference. For context, vLLM’s continuous batching maximizes concurrent requests, but batch sizes that are too large cause decode stalls as the prefill phase monopolizes compute. The goal is finding your workload’s “sweet spot” where you maximize tokens/second while staying within latency SLOs.

Static batching groups requests into fixed-size batches before processing. This approach is simpler to implement but suffers from inefficiency: if your batch size is 16 and only 8 requests arrive, you’re either waiting (increasing latency) or underutilizing capacity (wasting compute).

When to use static batching:

  • Predictable, steady-state workloads (e.g., nightly batch processing)
  • Latency-insensitive tasks (report generation, data enrichment)
  • Cost-optimized scenarios where 24-hour turnaround is acceptable

Azure OpenAI’s Batch API exemplifies static batching: you submit a file with up to 100,000 requests, wait 24 hours, and receive 50% cost savings. This works for offline analytics, but fails for real-time applications.

Dynamic batching forms batches on-the-fly as requests arrive, using timing windows or queue thresholds. This maximizes GPU utilization but requires sophisticated scheduling to prevent latency explosions.

Key characteristics:

  • Queue-based triggers: Scale up when queue exceeds threshold (Google recommends 3-5 requests)
  • Timeout-based windows: Form batch after fixed wait (e.g., 100ms) or when max size reached
  • Adaptive sizing: Adjust batch size based on real-time metrics

Google Cloud’s GKE autoscaling documentation recommends queue size metrics over CPU/GPU utilization for LLM inference. The HPA (Horizontal Pod Autoscaler) should scale based on queue depth, not resource usage, because GPU utilization is misleadingly low during memory-bound operations.

FeatureStatic BatchingDynamic Batching
Best ForOffline, latency-insensitiveReal-time, variable load
ImplementationSimple (file-based)Complex (scheduler required)
Cost50% cheaper (Azure)Standard pricing
LatencyHigh (24h SLA)Low (sub-second)
UtilizationLower (fixed batches)High (adaptive)

The optimal batch size is constrained by three factors:

Each request in a batch consumes KV cache memory proportional to sequence length and model size.

Memory calculation:

# KV cache memory per request
# Formula: 2 * num_layers * hidden_dim * precision * (prompt_tokens + max_new_tokens) / 8 / 1e9
# Example: Llama-2 7B on A100 (40GB)
# 2 * 32 * 4096 * 16 * 1024 / 8 / 1e9 = 0.53 GB per request
# Max batch size = (40GB - 14GB model weights) / 0.53GB ≈ 49 requests
  • Prefill bottleneck: Large batches slow down prompt processing
  • Decode stalls: Too many concurrent decodes reduce token generation rate
  • Memory bandwidth: Larger batches increase memory traffic

Your batch size must satisfy: latency(prefill) + latency(decode) ≤ SLA

For a 100ms SLA with 100 output tokens:

  • Prefill: 50ms (fixed)
  • Decode: 0.5ms per token
  • Max batch size: 100ms - 50ms / 0.5ms = 100 tokens total decode time

The following architecture implements dynamic batching with queue-based autoscaling and memory-aware batch sizing:

graph TD
A[Client Requests] --> B[Request Queue]
B --> C{Queue Size > Threshold?}
C -->|Yes| D[Form Batch]
C -->|No| E[Wait for Timeout]
D --> F[Memory Check]
F --> G[Process Batch]
G --> H[Response Queue]
H --> I[Return Results]
E --> D

Key components:

  1. Queue Manager: Buffers incoming requests and triggers batch formation based on queue size or timeout
  2. Memory Monitor: Calculates available KV cache memory before forming batches
  3. Batch Scheduler: Implements continuous batching logic for vLLM/TGI or dynamic batching for static endpoints
  4. Autoscaler: Uses queue size metrics (3-5 threshold) to scale pods/nodes

For asynchronous workloads, Azure OpenAI’s Batch API provides 50% cost savings with 24-hour SLA:

# Submit batch job with file expiration for 10,000 file limit
file = client.files.create(
file=f,
purpose="batch",
extra_body={"expires_after": {"seconds": 1209600, "anchor": "created_at"}}
)
# Monitor with queue size metrics
# HPA configuration for queue-based scaling

Configuration requirements:

  • File expiration must be set to exceed 7 days to unlock 10,000 file limit (default: 500 files)
  • Batch window: 24 hours for completion
  • Cost: 50% of standard pricing for input/output tokens

Complete Dynamic Batching Server with Autoscaling Signals

Section titled “Complete Dynamic Batching Server with Autoscaling Signals”
import asyncio
import time
from typing import List, Dict, Any, Optional
import aiohttp
from dataclasses import dataclass
from enum import Enum
class BatchStatus(Enum):
FORMING = "forming"
PROCESSING = "processing"
COMPLETED = "completed"
FAILED = "failed"
@dataclass
class BatchRequest:
prompt: str
max_tokens: int
temperature: float
future: asyncio.Future
timestamp: float
class ProductionBatchScheduler:
"""
Production-grade dynamic batching with:
- Queue size monitoring for autoscaling
- Memory-aware batch sizing
- Timeout-based batching windows
- Comprehensive metrics collection
"""
def __init__(
self,
model_endpoint: str,
max_batch_size: int = 32,
queue_threshold: int = 5,
max_wait_ms: int = 100,
max_memory_gb: float = 80.0
):
self.model_endpoint = model_endpoint
self.max_batch_size = max_batch_size
self.queue_threshold = queue_threshold
self.max_wait_ms = max_wait_ms
self.max_memory_gb = max_memory_gb
self.request_queue: List[BatchRequest] = []
self.metrics = {
"batch_sizes": [],
"queue_sizes": [],
"processing_times": [],
"scale_events": []
}
self._batch_task: Optional[asyncio.Task] = None
async def add_request(
self,
prompt: str,
max_tokens: int = 100,
temperature: float = 0.7
) -> Dict[str, Any]:
"""
Add request to queue with automatic batching.
Returns response when batch completes.
"""
future = asyncio.Future()
request = BatchRequest(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
future=future,
timestamp=time.time()
)
self.request_queue.append(request)
self.metrics["queue_sizes"].append(len(self.request_queue))
# Trigger batching if threshold reached
if len(self.request_queue) >= self.queue_threshold:
if self._batch_task is None or self._batch_task.done():
self._batch_task = asyncio.create_task(self._process_batch_loop())
# Wait for completion
return await future
async def _process_batch_loop(self):
"""
Continuous batching loop with timeout window.
"""
while self.request_queue:
# Wait for batching window
try:
await asyncio.wait_for(
asyncio.sleep(self.max_wait_ms / 1000),
timeout=self.max_wait_ms / 1000
)
except asyncio.TimeoutError:
pass
# Form batch
batch = self._form_batch()
if not batch:
continue
# Process batch
await self._execute_batch(batch)
# Check for scale-up signal
self._evaluate_scaling()
def _form_batch(self) -> List[BatchRequest]:
"""
Form batch respecting memory constraints and max size.
"""
if not self.request_queue:
return []
# Estimate memory usage (simplified)
estimated_memory_per_request = 0.5 # GB per request
available_memory = self.max_memory_gb - (len(self.request_queue) * estimated_memory_per_request)
memory_limited_size = int(available_memory / estimated_memory_per_request)
# Take minimum of: queue size, max batch size, memory limit
batch_size = min(
len(self.request_queue),
self.max_batch_size,
memory_limited_size
)
if batch_size <= 0:
return []
batch = self.request_queue[:batch_size]
self.request_queue = self.request_queue[batch_size:]
self.metrics["batch_sizes"].append(len(batch))
return batch
async def _execute_batch(self, batch: List[BatchRequest]):
"""
Execute batch against model endpoint.
"""
start_time = time.time()
try:
# Prepare batch payload
payload = {
"inputs": [req.prompt for req in batch],
"parameters": {
"max_tokens": max(req.max_tokens for req in batch),
"temperature": batch[0].temperature, # Use first request's temp
"return_full_text": False
}
}
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.model_endpoint}/generate",
json=payload,
timeout=aiohttp.ClientTimeout(total=30)
) as response:
result = await response.json()
outputs = result.get("outputs", [])
# Fulfill futures
for i, req in enumerate(batch):
if i < len(outputs):
req.future.set_result(outputs[i])
else:
req.future.set_exception(Exception("Missing output"))
processing_time = time.time() - start_time
self.metrics["processing_times"].append(processing_time)
except Exception as e:
# Fail all requests in batch
for req in batch:
req.future.set_exception(e)
self.metrics["processing_times"].append(-1) # Error indicator
def _evaluate_scaling(self):
"""
Generate scaling signals based on queue dynamics.
"""
if not self.metrics["queue_sizes"]:
return
recent_queue_sizes = self.metrics["queue_sizes"][-10:]
avg_queue = sum(recent_queue_sizes) / len(recent_queue_sizes)
current_queue = len(self.request_queue)
# Scale-up signal
if current_queue > self.queue_threshold * 2:
self.metrics["scale_events"].append({
"timestamp": time.time(),
"action": "scale_up",
"queue_size": current_queue,
"reason": "queue_threshold_exceeded"
})
print(f"🚨 SCALE UP: Queue={current_queue}, Threshold={self.queue_threshold}")
# Scale-down signal
if current_queue == 0 and len(self.metrics["batch_sizes"]) > 10:
avg_batch = sum(self.metrics["batch_sizes"][-10:]) / 10
if avg_batch < 5:
self.metrics["scale_events"].append({
"timestamp": time.time(),
"action": "scale_down",
"avg_batch": avg_batch,
"reason": "low_utilization"
})
print(f"✅ SCALE DOWN: Avg batch={avg_batch:.2f}")
def get_metrics(self) -> Dict[str, Any]:
"""Return comprehensive metrics for monitoring."""
if not self.metrics["processing_times"]:
return {"status": "no_data"}
valid_times = [t for t in self.metrics["processing_times"] if t > 0]
avg_latency = sum(valid_times) / len(valid_times) if valid_times else 0
return {
"avg_latency_s": avg_latency,
"total_requests": len(self.metrics["batch_sizes"]),
"avg_batch_size": sum(self.metrics["batch_sizes"]) / len(self.metrics["batch_sizes"]) if self.metrics["batch_sizes"] else 0,
"scale_events": len(self.metrics["scale_events"]),
"current_queue": len(self.request_queue)
}
# Production usage example
async def production_example():
scheduler = ProductionBatchScheduler(
model_endpoint="http://vllm-service:8000",
max_batch_size=32,
queue_threshold=5,
max_wait_ms=100
)
# Simulate production load
async def generate_load():
tasks = []
for i in range(50):
tasks.append(
scheduler.add_request(
prompt=f"Analyze this transaction: {i}",
max_tokens=100,
temperature=0.7
)
)
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Run load test
results = await generate_load()
metrics = scheduler.get_metrics()
print(f"Metrics: {metrics}")
return metrics
if __name__ == "__main__":
asyncio.run(production_example())

Avoid these critical mistakes that silently degrade performance and inflate costs:

  1. Queue threshold too high: Setting threshold greater than 10 without adjusting HPA scale-up settings causes poor handling of traffic spikes. The queue backs up before scaling triggers.

  2. Static batch sizes: Using fixed batch sizes without monitoring queue dynamics misses optimization opportunities. Workloads vary; your batch size should adapt.

  3. Ignoring latency impact: Larger batches increase throughput but raise latency due to prefill/decode interference. Always validate against your latency SLOs.

  4. Missing file expiration: For Azure OpenAI Batch API, failing to set expires_after limits you to 500 files instead of 10,000, severely constraining throughput.

  5. No exponential backoff: Submitting batch jobs without retry logic causes failures when hitting token limits. Always implement backoff with jitter.

  • Decode stalls: Large batches cause decode stalls as prefill monopolizes compute. Use stall-free scheduling or chunked prefills for decode-heavy workloads.
  • Memory exhaustion: Not monitoring KV cache memory leads to OOM errors. Calculate memory per request and enforce limits.
  • Over-quantization: Aggressive quantization (e.g., 4-bit on small models) can degrade quality beyond acceptable thresholds.
  • No batch size history: Failing to track batch size trends prevents capacity planning and autoscaling tuning.
  • Ignoring queue dynamics: Not monitoring queue depth over time misses patterns that inform optimal threshold settings.
  • Single-model endpoints: Mixing workloads on same endpoint reduces cache hit rate and increases latency. Separate deployments by workload.

Use this formula to estimate maximum batch size for your deployment:

# Memory calculation for KV cache
memory_per_request = 2 * num_layers * hidden_dim * precision * (prompt_tokens + max_new_tokens) / 8 / 1e9
# Available memory (GPU total - model weights)
available_memory_gb = gpu_memory_gb - (model_params * precision / 8 / 1e9)
# Maximum batch size
max_batch_size = int(available_memory_gb / memory_per_request)
# Optimal batch size (80% of max for latency buffer)
optimal_batch_size = int(max_batch_size * 0.8)
ParameterRecommended ValueRationale
Queue Threshold3-5 requestsGoogle GKE recommendation for throughput optimization
Max Wait Time50-100msBalances batching opportunity vs. latency
HPA Scale-Up30s cooldown, 2x scalePrevents thrashing while handling spikes
Batch Size16-32 (static)Sweet spot for most 7B-13B models on A100
File Expiration1209600s (14 days)Unlocks 10,000 file limit for Azure Batch API
Retry Backoff5s initial, 2x multiplierHandles token limit errors gracefully
  • Requests per file: 100,000
  • Files per resource: 500 (default) → 10,000 (with expiration)
  • Turnaround: 24 hours
  • Cost discount: 50% vs. standard
  • File size: 100MB max
ModelStandard Input/1MBatch Input/1MSavings
GPT-4.1$5.00$2.5050%
GPT-4.1-mini$0.15$0.07550%
GPT-4o$5.00$2.5050%
GPT-4o-mini$0.15$0.07550%

Source: Azure OpenAI pricing

HardwareModelQuantizationBatch SizeThroughput
1x L4 GPULlama-2 13BAWQ16120 req/s
1x A100 GPULlama-2 70BFP1632200 req/s
2x A100 GPULlama-2 70BFP1664380 req/s

Batch size calculator (hardware, model, latency SLA → optimal batch size)

Interactive widget derived from “Batching at Scale: Throughput Optimization for High-Volume Workloads” that lets readers explore batch size calculator (hardware, model, latency sla → optimal batch size).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.