Load Balancing Strategies for LLM Endpoints

A major e-commerce platform’s AI customer service system crashed for 4 hours during a flash sale because their naive round-robin load balancer sent 40% of requests to a single failing endpoint. The resulting queue buildup cascaded across their entire infrastructure, costing an estimated $2.3M in lost revenue. This guide provides production-ready load balancing strategies that prevent such disasters by intelligently distributing traffic across LLM endpoints.

Why This Matters

Traditional load balancing algorithms designed for stateless web services fail catastrophically with LLM endpoints. The fundamental differences include:

Variable request sizes: A single LLM request can range from 50 tokens to 200,000+ tokens, creating massive load imbalances with simple round-robin
Stateful connections: LLM endpoints maintain KV-cache state, making session affinity critical for performance
Rate limiting: Providers enforce strict token-per-minute limits (e.g., Anthropic’s 50,000 TPM for Claude 3.5 Sonnet)
Cost implications: Routing to expensive provisioned throughput when cheaper on-demand capacity is available wastes thousands of dollars monthly

According to Google Cloud’s documentation on advanced load balancing, their waterfall-by-region algorithm minimizes latency by routing to the closest region first, then overflowing to others only when capacity is exhausted. This approach reduces p99 latency by 40-60% compared to naive distribution.

Core Load Balancing Strategies

Round-Robin (The Baseline)

Simple round-robin distributes requests sequentially across endpoints. While easy to implement, it’s fundamentally flawed for LLM workloads.

Problems:

Ignores request size variance
No health awareness
No consideration for endpoint capacity
Cannot handle rate limits gracefully

When to use: Only for initial development or when all endpoints are identical in capacity, latency, and cost.

Priority-Based Routing

Priority-based routing assigns weights to endpoints based on cost, performance, or reliability tiers. Higher-priority endpoints receive more traffic until capacity limits are reached.

Implementation approach:

Tier 1: Provisioned throughput endpoints (lowest latency, highest cost)
Tier 2: On-demand endpoints (moderate latency, moderate cost)
Tier 3: Batch/budget endpoints (higher latency, lowest cost)

Latency-Aware Routing

Latency-aware routing continuously monitors endpoint response times and routes traffic to the fastest available endpoints. This requires:

Active health checks with lightweight probes
Exponential weighted moving average (EWMA) for latency smoothing
Dynamic weight adjustment based on real-time performance

Adaptive Load Balancing

Adaptive balancing combines priority, latency, and health signals to make routing decisions. It’s the gold standard for production LLM deployments.

Queue Management and Request Shedding

Queue buildup is the silent killer of LLM infrastructure. Without proper controls, a single slow endpoint can cause memory exhaustion and cascade failures.

The Queue Buildup Problem

When an endpoint becomes slow or unresponsive:

Requests pile up in the load balancer’s pending queue
Memory usage grows linearly with queue depth
New requests are still routed to the failing endpoint
Eventually, the entire system becomes unresponsive

Request Shedding Strategies

Global queue threshold: Reject new requests when total pending requests exceed a limit (e.g., 50 concurrent requests).

Per-endpoint queue threshold: Stop routing to an endpoint when its pending queue exceeds capacity (e.g., 5 requests).

Graceful degradation: When overloaded, route only high-priority requests and shed low-priority ones.

Practical Implementation

Configure endpoint health checks

Implement lightweight health checks that run every 30 seconds. Use a minimal request (e.g., 5 tokens) to verify responsiveness without consuming significant resources.
Implement intelligent routing logic

Build a routing algorithm that considers:
- Endpoint health status
- Current queue depth
- Recent latency metrics
- Priority tier
- Rate limit state (respect 429 responses and Retry-After headers)
Add request shedding mechanisms

Implement both global and per-endpoint queue limits. When limits are exceeded:
- Reject new requests with HTTP 503 (Service Unavailable)
- Include Retry-After header based on queue depth
- Log the event for capacity planning
Monitor and adapt

Continuously collect metrics:
- Request throughput per endpoint
- Latency percentiles (p50, p95, p99)
- Error rates
- Queue depths
- Token consumption per endpoint
Use these metrics to dynamically adjust routing weights and endpoint capacity.

import asyncio
import random
import time
from typing import List, Dict, Optional
import httpx

class LLMEndpoint:
    """Represents a single LLM endpoint with health and priority state."""

    def __init__(self, url: str, priority: int, api_key: str):
        self.url = url
        self.priority = priority
        self.api_key = api_key
        self.is_healthy = True
        self.retry_after: Optional[float] = None
        self.consecutive_failures = 0
        self.total_requests = 0
        self.failed_requests = 0

    def is_available(self) -> bool:
        """Check if endpoint is available (healthy and not throttled)."""
        if not self.is_healthy:
            return False
        if self.retry_after and time.time() < self.retry_after:
            return False
        return True

    def mark_failure(self, retry_after_seconds: Optional[float] = None):
        """Mark endpoint as failed, optionally setting retry-after."""
        self.consecutive_failures += 1
        self.failed_requests += 1
        self.is_healthy = False
        if retry_after_seconds:
            self.retry_after = time.time() + retry_after_seconds
        else:
            # Exponential backoff with jitter
            backoff = min(2 ** self.consecutive_failures, 60)
            jitter = random.uniform(0, backoff * 0.1)
            self.retry_after = time.time() + backoff + jitter

    def mark_success(self):
        """Reset failure count on successful request."""
        self.consecutive_failures = 0
        self.is_healthy = True
        self.total_requests += 1


class SmartLLMBalancer:
    """Intelligent load balancer for LLM endpoints with priority and health awareness."""

    def __init__(self, endpoints: List[Dict[str, str]]):
        self.endpoints = [
            LLMEndpoint(ep['url'], ep['priority'], ep['api_key'])
            for ep in endpoints
        ]
        # Sort by priority (lower number = higher priority)
        self.endpoints.sort(key=lambda x: x.priority)
        self.client = httpx.AsyncClient(timeout=30.0)

    async def route_request(self, payload: Dict, max_retries: int = 3) -> Dict:
        """Route request to the best available endpoint."""

        for attempt in range(max_retries):
            # Get available endpoints grouped by priority
            available = [ep for ep in self.endpoints if ep.is_available()]

            if not available:
                # All endpoints are down, try the highest priority one anyway
                target = self.endpoints[0]
            else:
                # Group by priority and select from highest priority group
                by_priority = {}
                for ep in available:
                    by_priority.setdefault(ep.priority, []).append(ep)

                # Select highest priority group
                highest_priority = min(by_priority.keys())
                candidates = by_priority[highest_priority]

                # Randomly select from candidates with same priority
                target = random.choice(candidates)

            try:
                response = await self.client.post(
                    f"{target.url}/v1/chat/completions",
                    json=payload,
                    headers={"Authorization": f"Bearer {target.api_key}"}
                )

                if response.status_code == 200:
                    target.mark_success()
                    return response.json()

                elif response.status_code == 429:
                    # Handle rate limiting
                    retry_after = response.headers.get("retry-after")
                    retry_seconds = int(retry_after) if retry_after else 60
                    target.mark_failure(retry_seconds)

                    # Log the throttling event
                    print(f"Endpoint {target.url} throttled. Retry after {retry_seconds}s")

                    # Continue to next attempt with different endpoint
                    continue

                elif response.status_code >= 500:
                    # Server error, mark as unhealthy
                    target.mark_failure()
                    continue

                else:
                    # Other errors, mark as failure
                    target.mark_failure()
                    continue

            except (httpx.TimeoutException, httpx.RequestError) as e:
                print(f"Request error to {target.url}: {e}")
                target.mark_failure()
                continue

        # If all attempts fail, raise error
        raise Exception("All endpoints failed after retries")

    def get_stats(self) -> Dict:
        """Get statistics about endpoint health and usage."""
        return {
            ep.url: {
                'priority': ep.priority,
                'healthy': ep.is_healthy,
                'available': ep.is_available(),
                'total_requests': ep.total_requests,
                'failed_requests': ep.failed_requests,
                'consecutive_failures': ep.consecutive_failures
            }
            for ep in self.endpoints
        }


# Example usage
async def main():
    endpoints = [
        {
            "url": "https://api.openai.com",
            "priority": 1,
            "api_key": "sk-proj-..."
        },
        {
            "url": "https://api.anthropic.com",
            "priority": 2,
            "api_key": "sk-ant-..."
        }
    ]

    balancer = SmartLLMBalancer(endpoints)

    payload = {
        "model": "gpt-4o",
        "messages": [{"role": "user", "content": "Hello, world!"}]
    }

    try:
        response = await balancer.route_request(payload)
        print("Success:", response)
    except Exception as e:
        print(f"Failed: {e}")

    print("\nEndpoint Stats:")
    print(balancer.get_stats())


if __name__ == "__main__":
    asyncio.run(main())

import http from 'http';
import https from 'https';

interface EndpointConfig {
  url: string;
  priority: number;
  apiKey: string;
  weight?: number;
}

interface EndpointMetrics {
  latency: number;
  throughput: number;
  errorRate: number;
  lastUpdated: number;
}

interface EndpointState extends EndpointConfig {
  metrics: EndpointMetrics;
  isHealthy: boolean;
  retryAfter?: number;
  consecutiveFailures: number;
  totalRequests: number;
  failedRequests: number;
}

export class AdaptiveLLMBalancer {
  private endpoints: EndpointState[];
  private readonly HEALTH_CHECK_INTERVAL = 30000; // 30 seconds
  private readonly LATENCY_THRESHOLD = 2000; // 2 seconds
  private readonly ERROR_THRESHOLD = 0.1; // 10% error rate

  constructor(configs: EndpointConfig[]) {
    this.endpoints = configs.map(config => ({
      ...config,
      metrics: { latency: 0, throughput: 0, errorRate: 0, lastUpdated: 0 },
      isHealthy: true,
      consecutiveFailures: 0,
      totalRequests: 0,
      failedRequests: 0
    }));

    // Start background health monitoring
    this.startHealthMonitoring();
  }

  private startHealthMonitoring(): void {
    setInterval(() => {
      this.endpoints.forEach(endpoint => {
        this.performHealthCheck(endpoint);
      });
    }, this.HEALTH_CHECK_INTERVAL);
  }

  private async performHealthCheck(endpoint: EndpointState): Promise<void> {
    const startTime = Date.now();

    try {
      const response = await this.makeRequest(endpoint, {
        model: 'health-check',
        messages: [{ role: 'user', content: 'ping' }],
        max_tokens: 5
      });

      const latency = Date.now() - startTime;

      // Update metrics
      endpoint.metrics.latency = latency;
      endpoint.metrics.lastUpdated = Date.now();

      // Update health status based on latency
      if (latency > this.LATENCY_THRESHOLD) {
        console.warn(`High latency detected for ${endpoint.url}: ${latency}ms`);
      }

      // Calculate error rate
      if (endpoint.totalRequests > 0) {
        endpoint.metrics.errorRate = endpoint.failedRequests / endpoint.totalRequests;

        if (endpoint.metrics.errorRate > this.ERROR_THRESHOLD) {
          endpoint.isHealthy = false;
          console.warn(`High error rate for ${endpoint.url}: ${(endpoint.metrics.errorRate * 100).toFixed(2)}%`);
        }
      }

    } catch (error) {
      console.error(`Health check failed for ${endpoint.url}:`, error);
      endpoint.isHealthy = false;
      endpoint.consecutiveFailures++;
    }
  }

  private makeRequest(endpoint: EndpointState, payload: any): Promise<any> {
    return new Promise((resolve, reject) => {
      const startTime = Date.now();
      const data = JSON.stringify(payload);

      const url = new URL(endpoint.url);
      const options = {
        hostname: url.hostname,
        port: url.port || (url.protocol === 'https:' ? 443 : 80),
        path: `${url.pathname}/v1/chat/completions`,
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Content-Length': Buffer.byteLength(data),
          'Authorization': `Bearer ${endpoint.apiKey}`
        }
      };

      const protocol = url.protocol === 'https:' ? https : http;
      const req = protocol.request(options, (res) => {
        let body = '';

        res.on('data', (chunk) => body += chunk);
        res.on('end', () => {
          const latency = Date.now() - startTime;

          if (res.statusCode === 200) {
            endpoint.totalRequests++;
            endpoint.consecutiveFailures = 0;
            endpoint.isHealthy = true;

            // Update latency metric (EWMA)
            const alpha = 0.3; // Exponential smoothing factor
            endpoint.metrics.latency =
              alpha * latency + (1 - alpha) * endpoint.metrics.latency;

            try {
              resolve(JSON.parse(body));
            } catch (e) {
              reject(new Error(`Invalid JSON response: ${e}`));
            }
          } else if (res.statusCode === 429) {
            // Rate limited
            const retryAfter = parseInt(res.headers['retry-after'] || '60');
            endpoint.retryAfter = Date.now() + (retryAfter * 1000);
            endpoint.failedRequests++;
            endpoint.consecutiveFailures++;

            reject(new Error(`Rate limited. Retry after ${retryAfter}s`));
          } else {
            // Other errors
            endpoint.failedRequests++;
            endpoint.consecutiveFailures++;

            if (res.statusCode >= 500) {
              endpoint.isHealthy = false;
            }

            reject(new Error(`HTTP ${res.statusCode}: ${body}`));
          }
        });
      });

      req.on('error', (error) => {
        endpoint.failedRequests++;
        endpoint.consecutiveFailures++;
        reject(error);
      });

      req.on('timeout', () => {
        req.destroy();
        endpoint.failedRequests++;
        endpoint.consecutiveFailures++;
        reject(new Error('Request timeout'));
      });

      req.write(data);
      req.end();
    });
  }

  public async routeRequest(payload: any, maxRetries: number = 3): Promise<any> {
    // Filter available endpoints
    const available = this.endpoints.filter(ep => {
      if (!ep.isHealthy) return false;
      if (ep.retryAfter && Date.now() < ep.retryAfter) return false;
      return true;
    });

    if (available.length === 0) {
      // Fallback to any endpoint
      return this.makeRequest(this.endpoints[0], payload);
    }

    // Sort by adaptive score (priority * latency factor)
    const scored = available.map(ep => {
      const latencyFactor = Math.max(1, ep.metrics.latency / 1000);
      const errorFactor = 1 + (ep.metrics.errorRate * 10);
      const score = ep.priority * latencyFactor * errorFactor;
      return { endpoint: ep, score };
    }).sort((a, b) => a.score - b.score);

    // Try endpoints in order of score
    for (const { endpoint } of scored) {
      try {
        return await this.makeRequest(endpoint, payload);
      } catch (error) {
        console.warn(`Request to ${endpoint.url} failed:`, error.message);
        // Continue to next endpoint
      }
    }

    throw new Error('All available endpoints failed');
  }

  public getMetrics(): any {
    return this.endpoints.map(ep => ({
      url: ep.url,
      priority: ep.priority,
      healthy: ep.isHealthy,
      metrics: ep.metrics,
      totalRequests: ep.totalRequests,
      failedRequests: ep.failedRequests,
      consecutiveFailures: ep.consecutiveFailures
    }));
  }
}

// Example usage
const balancer = new AdaptiveLLMBalancer([
  {
    url: 'https://api.openai.com',
    priority: 1,
    apiKey: process.env.OPENAI_API_KEY || 'sk-proj-...'
  },
  {
    url: 'https://api.anthropic.com',
    priority: 2,
    apiKey: process.env.ANTHROPIC_API_KEY || 'sk-ant-...'
  }
]);

// Example request
const payload = {
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Explain quantum computing' }]
};

balancer.routeRequest(payload)
  .then(response => console.log('Success:', response))
  .catch(error => console.error('Failed:', error.message));

// Periodic metrics logging
setInterval(() => {
  console.log('\n=== Endpoint Metrics ===');
  console.log(JSON.stringify(balancer.getMetrics(), null, 2));
}, 60000);

Common Pitfalls

Pitfall 1: Ignoring 429 Rate Limit Responses

LLM providers return HTTP 429 with Retry-After headers when you exceed rate limits. Ignoring these causes wasted requests and potential account suspension. Always respect the Retry-After header and implement exponential backoff.
Pitfall 2: No Request Shedding Mechanism

Without queue limits, a single slow endpoint can consume all available memory. Implement both global (e.g., 50 concurrent) and per-endpoint (e.g., 5 concurrent) queue thresholds.
Pitfall 3: Treating All Requests Equally

A 200,000-token context request consumes 4000x more resources than a 50-token request. Round-robin distribution will quickly overwhelm endpoints. Use least-connections or load-based routing instead.
Pitfall 4: Missing Health Checks

Sending traffic to unhealthy endpoints wastes money and increases latency. Implement active health checks every 30 seconds with lightweight probes.
Pitfall 5: Client-Side Retries Without Server-Side Failover

Client retries can create thundering herd problems when endpoints recover. Server-side failover with circuit breakers prevents this.
Pitfall 6: Not Monitoring Queue Depths

Queue depth is the canary in the coal mine. Monitor it religiously and alert when it exceeds 70% of capacity.
Pitfall 7: Ignoring Cost Tiers

Routing to expensive provisioned throughput (PTU) when cheaper on-demand capacity is available can increase costs by 5-10x. Implement priority-based routing to optimize costs.
Pitfall 8: No Geographic Awareness

Routing cross-region adds 100-300ms of latency. Use region-aware routing to prioritize local endpoints.

Quick Reference: Model Pricing

When implementing cost-aware load balancing, reference these current pricing tiers (as of December 2025):

Model	Provider	Input Cost	Output Cost	Context Window
Claude 3.5 Sonnet	Anthropic	$3.00/1M	$15.00/1M	200K tokens
Claude 3.5 Haiku	Anthropic	$0.80/1M	$4.00/1M	200K tokens
GPT-4o	OpenAI	$5.00/1M	$15.00/1M	128K tokens
GPT-4o mini	OpenAI	$0.15/1M	$0.60/1M	128K tokens
Gemini 2.0 Flash	Google	$0.10/1M	$0.40/1M	1M tokens
Gemini 2.0 Flash Lite	Google	$0.075/1M	$0.30/1M	1M tokens

Quick Reference: Load Balancing Algorithms

Algorithm	Best For	Latency	Cost	Complexity
Round-Robin	Development/testing	High	High	Low
Priority-Based	Cost optimization	Medium	Low	Medium
Latency-Aware	Performance-critical	Low	Medium	Medium
Adaptive	Production (all factors)	Low	Low	High
Queue-Aware	High-traffic systems	Low	Medium	High

Load balancer configuration tool

Interactive widget derived from “Load Balancing Strategies for LLM Endpoints” that lets readers explore load balancer configuration tool.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Production Best Practices

1. Implement Circuit Breakers

Circuit breakers prevent cascading failures by automatically stopping traffic to failing endpoints. Configure:

Failure threshold: 5 consecutive failures
Open state duration: 60 seconds (or respect Retry-After)
Half-open state: Allow 1 test request after cooldown

2. Use Request Hedging

For p99 latency requirements, send duplicate requests to multiple endpoints and use the first response. Cancel slower requests to avoid wasted resources.

3. Monitor Key Metrics

Essential metrics to track:

Throughput: Requests per minute per endpoint
Latency: p50, p95, p99, p99.9
Error rates: Per-endpoint and global
Queue depth: Real-time and historical
Token consumption: Per-endpoint and per-customer
Cost per request: By endpoint tier

4. Implement Gradual Rollout

When adding new endpoints:

Start with 5% traffic
Monitor for 24 hours
Gradually increase to 100%
Keep rollback plan ready

5. Plan for Failover

Design your architecture so that losing any single endpoint doesn’t impact availability:

Minimum 2 endpoints per tier
Cross-region redundancy
Automated failover within 30 seconds

Case Study: Google Cloud’s Service Load Balancing

Google Cloud’s advanced load balancing demonstrates the power of intelligent routing. Their system uses:

Auto-capacity draining: Automatically removes unhealthy backends when less than 25% of instances pass health checks, triggering immediate failover cloud.google.com.

Waterfall-by-region algorithm: Routes to the closest region first, then overflows to others only when capacity is exhausted. This approach reduces p99 latency by 40-60% compared to naive distribution cloud.google.com.

Traffic isolation: Prevents noisy neighbor problems by isolating traffic patterns and ensuring fair resource allocation.

These features enable sub-second failover and maintain performance even during regional outages.

Summary

Load balancing is critical: LLM endpoints require more than simple round-robin distribution
Health awareness is mandatory: Active health checks prevent traffic to failing endpoints
Queue management prevents cascade failures: Implement global and per-endpoint thresholds
Cost optimization matters: Priority-based routing can reduce costs by 50-70%
Monitoring is essential: Track throughput, latency, errors, and queue depths
Adaptive algorithms win: Combine multiple signals for optimal routing decisions
Respect rate limits: Always handle 429 responses and Retry-After headers

Batching Architecture Combine multiple requests to reduce costs and improve throughput

Latency Monitoring Set up comprehensive monitoring for LLM endpoint performance

Rate Limiting Strategies Prevent API quota exhaustion and manage token budgets

Performance Hub Complete guide to LLM performance optimization

Load Balancing Strategies for LLM Endpoints

Load Balancing Strategies for LLM Endpoints

Why This Matters

Core Load Balancing Strategies

Round-Robin (The Baseline)

Priority-Based Routing

Latency-Aware Routing

Adaptive Load Balancing

Queue Management and Request Shedding

The Queue Buildup Problem

Request Shedding Strategies

Practical Implementation

Code Example

Common Pitfalls

Quick Reference: Model Pricing

Quick Reference: Load Balancing Algorithms

Widget

Production Best Practices

1. Implement Circuit Breakers

2. Use Request Hedging

3. Monitor Key Metrics

4. Implement Gradual Rollout

5. Plan for Failover

Case Study: Google Cloud’s Service Load Balancing

Summary

Related Resources