Batching at Scale: Throughput Optimization for High-Volume Workloads

A misconfigured batch size can silently destroy your unit economics. One deployment processing 100,000 requests daily with a batch size of 1 instead of 16 is burning 60-70% more GPU capacity than necessary—translating to tens of thousands in wasted compute costs monthly. Batching isn’t just a performance optimization; it’s a financial imperative for any production LLM system handling high-volume workloads.

Why This Matters

Batching directly impacts your bottom line in three measurable ways. First, compute utilization: GPUs process matrix operations in parallel; small batches leave cores idle. Second, cost efficiency: Azure OpenAI’s Batch API offers 50% cost reduction for asynchronous workloads with 24-hour turnaround windows. Third, infrastructure footprint: Proper batching can reduce GPU requirements by 50% through quantization and memory optimization, as demonstrated in Google Cloud’s benchmarks where AWQ quantization enabled Llama-2 13B to run on a single L4 GPU instead of two.

The throughput-latency tradeoff is non-negotiable. Large batches increase throughput linearly but introduce latency penalties due to prefill/decode interference. For context, vLLM’s continuous batching maximizes concurrent requests, but batch sizes that are too large cause decode stalls as the prefill phase monopolizes compute. The goal is finding your workload’s “sweet spot” where you maximize tokens/second while staying within latency SLOs.

Static vs Dynamic Batching

Static Batching

Static batching groups requests into fixed-size batches before processing. This approach is simpler to implement but suffers from inefficiency: if your batch size is 16 and only 8 requests arrive, you’re either waiting (increasing latency) or underutilizing capacity (wasting compute).

When to use static batching:

Predictable, steady-state workloads (e.g., nightly batch processing)
Latency-insensitive tasks (report generation, data enrichment)
Cost-optimized scenarios where 24-hour turnaround is acceptable

Azure OpenAI’s Batch API exemplifies static batching: you submit a file with up to 100,000 requests, wait 24 hours, and receive 50% cost savings. This works for offline analytics, but fails for real-time applications.

Dynamic Batching

Dynamic batching forms batches on-the-fly as requests arrive, using timing windows or queue thresholds. This maximizes GPU utilization but requires sophisticated scheduling to prevent latency explosions.

Key characteristics:

Queue-based triggers: Scale up when queue exceeds threshold (Google recommends 3-5 requests)
Timeout-based windows: Form batch after fixed wait (e.g., 100ms) or when max size reached
Adaptive sizing: Adjust batch size based on real-time metrics

Google Cloud’s GKE autoscaling documentation recommends queue size metrics over CPU/GPU utilization for LLM inference. The HPA (Horizontal Pod Autoscaler) should scale based on queue depth, not resource usage, because GPU utilization is misleadingly low during memory-bound operations.

Comparison Table

Feature	Static Batching	Dynamic Batching
Best For	Offline, latency-insensitive	Real-time, variable load
Implementation	Simple (file-based)	Complex (scheduler required)
Cost	50% cheaper (Azure)	Standard pricing
Latency	High (24h SLA)	Low (sub-second)
Utilization	Lower (fixed batches)	High (adaptive)

Batch Size Selection: The Math

The optimal batch size is constrained by three factors:

1. Memory Capacity

Each request in a batch consumes KV cache memory proportional to sequence length and model size.

Memory calculation:

# KV cache memory per request
# Formula: 2 * num_layers * hidden_dim * precision * (prompt_tokens + max_new_tokens) / 8 / 1e9
# Example: Llama-2 7B on A100 (40GB)
# 2 * 32 * 4096 * 16 * 1024 / 8 / 1e9 = 0.53 GB per request
# Max batch size = (40GB - 14GB model weights) / 0.53GB ≈ 49 requests

2. Compute Constraints

Prefill bottleneck: Large batches slow down prompt processing
Decode stalls: Too many concurrent decodes reduce token generation rate
Memory bandwidth: Larger batches increase memory traffic

3. Latency SLA

Your batch size must satisfy: latency(prefill) + latency(decode) ≤ SLA

For a 100ms SLA with 100 output tokens:

Prefill: 50ms (fixed)
Decode: 0.5ms per token
Max batch size: 100ms - 50ms / 0.5ms = 100 tokens total decode time

Practical Implementation

Production-Ready Batching Architecture

The following architecture implements dynamic batching with queue-based autoscaling and memory-aware batch sizing:

graph TD
    A[Client Requests] --> B[Request Queue]
    B --> C{Queue Size > Threshold?}
    C -->|Yes| D[Form Batch]
    C -->|No| E[Wait for Timeout]
    D --> F[Memory Check]
    F --> G[Process Batch]
    G --> H[Response Queue]
    H --> I[Return Results]
    E --> D

Key components:

Queue Manager: Buffers incoming requests and triggers batch formation based on queue size or timeout
Memory Monitor: Calculates available KV cache memory before forming batches
Batch Scheduler: Implements continuous batching logic for vLLM/TGI or dynamic batching for static endpoints
Autoscaler: Uses queue size metrics (3-5 threshold) to scale pods/nodes

Azure OpenAI Batch API Implementation

For asynchronous workloads, Azure OpenAI’s Batch API provides 50% cost savings with 24-hour SLA:

# Submit batch job with file expiration for 10,000 file limit
file = client.files.create(
    file=f,
    purpose="batch",
    extra_body={"expires_after": {"seconds": 1209600, "anchor": "created_at"}}
)

# Monitor with queue size metrics
# HPA configuration for queue-based scaling

Configuration requirements:

File expiration must be set to exceed 7 days to unlock 10,000 file limit (default: 500 files)
Batch window: 24 hours for completion
Cost: 50% of standard pricing for input/output tokens

Code Example

Complete Dynamic Batching Server with Autoscaling Signals

import asyncio
import time
from typing import List, Dict, Any, Optional
import aiohttp
from dataclasses import dataclass
from enum import Enum

class BatchStatus(Enum):
    FORMING = "forming"
    PROCESSING = "processing"
    COMPLETED = "completed"
    FAILED = "failed"

@dataclass
class BatchRequest:
    prompt: str
    max_tokens: int
    temperature: float
    future: asyncio.Future
    timestamp: float

class ProductionBatchScheduler:
    """
    Production-grade dynamic batching with:
    - Queue size monitoring for autoscaling
    - Memory-aware batch sizing
    - Timeout-based batching windows
    - Comprehensive metrics collection
    """

    def __init__(
        self,
        model_endpoint: str,
        max_batch_size: int = 32,
        queue_threshold: int = 5,
        max_wait_ms: int = 100,
        max_memory_gb: float = 80.0
    ):
        self.model_endpoint = model_endpoint
        self.max_batch_size = max_batch_size
        self.queue_threshold = queue_threshold
        self.max_wait_ms = max_wait_ms
        self.max_memory_gb = max_memory_gb

        self.request_queue: List[BatchRequest] = []
        self.metrics = {
            "batch_sizes": [],
            "queue_sizes": [],
            "processing_times": [],
            "scale_events": []
        }

        self._batch_task: Optional[asyncio.Task] = None

    async def add_request(
        self,
        prompt: str,
        max_tokens: int = 100,
        temperature: float = 0.7
    ) -> Dict[str, Any]:
        """
        Add request to queue with automatic batching.
        Returns response when batch completes.
        """
        future = asyncio.Future()
        request = BatchRequest(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            future=future,
            timestamp=time.time()
        )

        self.request_queue.append(request)
        self.metrics["queue_sizes"].append(len(self.request_queue))

        # Trigger batching if threshold reached
        if len(self.request_queue) >= self.queue_threshold:
            if self._batch_task is None or self._batch_task.done():
                self._batch_task = asyncio.create_task(self._process_batch_loop())

        # Wait for completion
        return await future

    async def _process_batch_loop(self):
        """
        Continuous batching loop with timeout window.
        """
        while self.request_queue:
            # Wait for batching window
            try:
                await asyncio.wait_for(
                    asyncio.sleep(self.max_wait_ms / 1000),
                    timeout=self.max_wait_ms / 1000
                )
            except asyncio.TimeoutError:
                pass

            # Form batch
            batch = self._form_batch()
            if not batch:
                continue

            # Process batch
            await self._execute_batch(batch)

            # Check for scale-up signal
            self._evaluate_scaling()

    def _form_batch(self) -> List[BatchRequest]:
        """
        Form batch respecting memory constraints and max size.
        """
        if not self.request_queue:
            return []

        # Estimate memory usage (simplified)
        estimated_memory_per_request = 0.5  # GB per request
        available_memory = self.max_memory_gb - (len(self.request_queue) * estimated_memory_per_request)
        memory_limited_size = int(available_memory / estimated_memory_per_request)

        # Take minimum of: queue size, max batch size, memory limit
        batch_size = min(
            len(self.request_queue),
            self.max_batch_size,
            memory_limited_size
        )

        if batch_size <= 0:
            return []

        batch = self.request_queue[:batch_size]
        self.request_queue = self.request_queue[batch_size:]

        self.metrics["batch_sizes"].append(len(batch))
        return batch

    async def _execute_batch(self, batch: List[BatchRequest]):
        """
        Execute batch against model endpoint.
        """
        start_time = time.time()

        try:
            # Prepare batch payload
            payload = {
                "inputs": [req.prompt for req in batch],
                "parameters": {
                    "max_tokens": max(req.max_tokens for req in batch),
                    "temperature": batch[0].temperature,  # Use first request's temp
                    "return_full_text": False
                }
            }

            async with aiohttp.ClientSession() as session:
                async with session.post(
                    f"{self.model_endpoint}/generate",
                    json=payload,
                    timeout=aiohttp.ClientTimeout(total=30)
                ) as response:
                    result = await response.json()
                    outputs = result.get("outputs", [])

                    # Fulfill futures
                    for i, req in enumerate(batch):
                        if i < len(outputs):
                            req.future.set_result(outputs[i])
                        else:
                            req.future.set_exception(Exception("Missing output"))

            processing_time = time.time() - start_time
            self.metrics["processing_times"].append(processing_time)

        except Exception as e:
            # Fail all requests in batch
            for req in batch:
                req.future.set_exception(e)

            self.metrics["processing_times"].append(-1)  # Error indicator

    def _evaluate_scaling(self):
        """
        Generate scaling signals based on queue dynamics.
        """
        if not self.metrics["queue_sizes"]:
            return

        recent_queue_sizes = self.metrics["queue_sizes"][-10:]
        avg_queue = sum(recent_queue_sizes) / len(recent_queue_sizes)
        current_queue = len(self.request_queue)

        # Scale-up signal
        if current_queue > self.queue_threshold * 2:
            self.metrics["scale_events"].append({
                "timestamp": time.time(),
                "action": "scale_up",
                "queue_size": current_queue,
                "reason": "queue_threshold_exceeded"
            })
            print(f"🚨 SCALE UP: Queue={current_queue}, Threshold={self.queue_threshold}")

        # Scale-down signal
        if current_queue == 0 and len(self.metrics["batch_sizes"]) > 10:
            avg_batch = sum(self.metrics["batch_sizes"][-10:]) / 10
            if avg_batch < 5:
                self.metrics["scale_events"].append({
                    "timestamp": time.time(),
                    "action": "scale_down",
                    "avg_batch": avg_batch,
                    "reason": "low_utilization"
                })
                print(f"✅ SCALE DOWN: Avg batch={avg_batch:.2f}")

    def get_metrics(self) -> Dict[str, Any]:
        """Return comprehensive metrics for monitoring."""
        if not self.metrics["processing_times"]:
            return {"status": "no_data"}

        valid_times = [t for t in self.metrics["processing_times"] if t > 0]
        avg_latency = sum(valid_times) / len(valid_times) if valid_times else 0

        return {
            "avg_latency_s": avg_latency,
            "total_requests": len(self.metrics["batch_sizes"]),
            "avg_batch_size": sum(self.metrics["batch_sizes"]) / len(self.metrics["batch_sizes"]) if self.metrics["batch_sizes"] else 0,
            "scale_events": len(self.metrics["scale_events"]),
            "current_queue": len(self.request_queue)
        }

# Production usage example
async def production_example():
    scheduler = ProductionBatchScheduler(
        model_endpoint="http://vllm-service:8000",
        max_batch_size=32,
        queue_threshold=5,
        max_wait_ms=100
    )

    # Simulate production load
    async def generate_load():
        tasks = []
        for i in range(50):
            tasks.append(
                scheduler.add_request(
                    prompt=f"Analyze this transaction: {i}",
                    max_tokens=100,
                    temperature=0.7
                )
            )
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

    # Run load test
    results = await generate_load()
    metrics = scheduler.get_metrics()
    print(f"Metrics: {metrics}")

    return metrics

if __name__ == "__main__":
    asyncio.run(production_example())

Common Pitfalls

Avoid these critical mistakes that silently degrade performance and inflate costs:

Configuration Errors

Queue threshold too high: Setting threshold greater than 10 without adjusting HPA scale-up settings causes poor handling of traffic spikes. The queue backs up before scaling triggers.
Static batch sizes: Using fixed batch sizes without monitoring queue dynamics misses optimization opportunities. Workloads vary; your batch size should adapt.
Ignoring latency impact: Larger batches increase throughput but raise latency due to prefill/decode interference. Always validate against your latency SLOs.
Missing file expiration: For Azure OpenAI Batch API, failing to set expires_after limits you to 500 files instead of 10,000, severely constraining throughput.
No exponential backoff: Submitting batch jobs without retry logic causes failures when hitting token limits. Always implement backoff with jitter.

Performance Pitfalls

Decode stalls: Large batches cause decode stalls as prefill monopolizes compute. Use stall-free scheduling or chunked prefills for decode-heavy workloads.
Memory exhaustion: Not monitoring KV cache memory leads to OOM errors. Calculate memory per request and enforce limits.
Over-quantization: Aggressive quantization (e.g., 4-bit on small models) can degrade quality beyond acceptable thresholds.

Monitoring Gaps

No batch size history: Failing to track batch size trends prevents capacity planning and autoscaling tuning.
Ignoring queue dynamics: Not monitoring queue depth over time misses patterns that inform optimal threshold settings.
Single-model endpoints: Mixing workloads on same endpoint reduces cache hit rate and increases latency. Separate deployments by workload.

Quick Reference

Batch Size Calculator

Use this formula to estimate maximum batch size for your deployment:

# Memory calculation for KV cache
memory_per_request = 2 * num_layers * hidden_dim * precision * (prompt_tokens + max_new_tokens) / 8 / 1e9

# Available memory (GPU total - model weights)
available_memory_gb = gpu_memory_gb - (model_params * precision / 8 / 1e9)

# Maximum batch size
max_batch_size = int(available_memory_gb / memory_per_request)

# Optimal batch size (80% of max for latency buffer)
optimal_batch_size = int(max_batch_size * 0.8)

Configuration Cheat Sheet

Parameter	Recommended Value	Rationale
Queue Threshold	3-5 requests	Google GKE recommendation for throughput optimization
Max Wait Time	50-100ms	Balances batching opportunity vs. latency
HPA Scale-Up	30s cooldown, 2x scale	Prevents thrashing while handling spikes
Batch Size	16-32 (static)	Sweet spot for most 7B-13B models on A100
File Expiration	1209600s (14 days)	Unlocks 10,000 file limit for Azure Batch API
Retry Backoff	5s initial, 2x multiplier	Handles token limit errors gracefully

Azure OpenAI Batch API Limits

Requests per file: 100,000
Files per resource: 500 (default) → 10,000 (with expiration)
Turnaround: 24 hours
Cost discount: 50% vs. standard
File size: 100MB max

Model Pricing (Batch vs. Standard)

Model	Standard Input/1M	Batch Input/1M	Savings
GPT-4.1	$5.00	$2.50	50%
GPT-4.1-mini	$0.15	$0.075	50%
GPT-4o	$5.00	$2.50	50%
GPT-4o-mini	$0.15	$0.075	50%

Source: Azure OpenAI pricing

Hardware Efficiency

Hardware	Model	Quantization	Batch Size	Throughput
1x L4 GPU	Llama-2 13B	AWQ	16	120 req/s
1x A100 GPU	Llama-2 70B	FP16	32	200 req/s
2x A100 GPU	Llama-2 70B	FP16	64	380 req/s

Batch size calculator (hardware, model, latency SLA → optimal batch size)

Interactive widget derived from “Batching at Scale: Throughput Optimization for High-Volume Workloads” that lets readers explore batch size calculator (hardware, model, latency sla → optimal batch size).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.