A fintech startup processing 1M+ daily requests faced a nightmare: their AI-powered fraud detection system spiked from $12K to $89K in a single week. The culprit? Unthrottled retries during a traffic surge. This guide teaches you to prevent that fate using SLA-aware rate limiting that protects both your budget and your service commitments.
Rate limiting has evolved from simple request counters to sophisticated cost-control mechanisms. With modern LLMs costing $3-15 per million tokens, a poorly configured system can burn through budgets in hours. According to Anthropic’s pricing, Claude 3.5 Sonnet charges $3.00 per million input tokens and $15.00 per million output tokens Anthropic Pricing , while OpenAI’s GPT-4o costs $5.00/$15.00 per million tokens OpenAI Pricing . At scale, these costs compound rapidly.
The business impact is severe:
Budget overruns : Uncontrolled requests can 5-10x your monthly bill
SLA violations : Throttling without prioritization leads to dropped premium-tier requests
Revenue loss : Degraded user experience directly impacts retention
Operational risk : System instability during traffic spikes
Traditional rate limiters (token bucket, fixed window) treat all traffic equally. Modern systems need adaptive throttling that considers:
Request cost (token volume)
User tier (SLA commitments)
Model pricing (dynamic routing)
Time-of-day patterns
Queue depth
Adaptive throttling adjusts request processing rates based on real-time system state and business rules. Unlike static limiters, it dynamically responds to cost pressure, queue depth, and SLA requirements.
1. Cost-Aware Token Bucket
Traditional token buckets refill at fixed rates. Cost-aware variants weight tokens by actual model cost:
A[Request Ingress] --> B{Cost Estimator}
B -->|High Cost| C[Priority Queue]
B -->|Low Cost| D[Standard Queue]
C --> E{Token Bucket<br/>Weighted by $}
F --> G[Response + Token Metadata]
G --> H[Update Cost Model]
Key Mechanism:
Each request consumes tokens proportional to estimated cost (input/output token ratio × model price)
Buckets refill based on budget window (e.g., $100/hour)
High-cost requests trigger priority queuing
Real-time metadata from LLM responses refines cost estimates
2. Priority Queuing
Requests are tagged with SLA tiers:
Tier 1 (Premium) : Guaranteed processing, bypasses throttling
Tier 2 (Standard) : Subject to cost-based throttling
Tier 3 (Batch) : Lowest priority, processed during low-cost periods
3. Dynamic Routing
Based on cost pressure, route requests to cheaper models:
High load → GPT-4o-mini ($0.15/$0.60 per 1M tokens)
Normal load → GPT-4o ($5/$15 per 1M tokens)
Critical requests → Claude 3.5 Sonnet ($3/$15 per 1M tokens)
4. Graceful Degradation
When budget thresholds are approached:
Enable request coalescing (batch similar queries)
Reduce context window size
Switch to cached responses
Enable “low-cost mode” (shorter responses, reduced reasoning effort)
Create budget windows aligned with business cycles:
Window Budget Action on Breach Hourly $50 Enable Tier 3 throttling Daily $500 Route 50% to mini-models Monthly $5,000 Emergency circuit breaker
Use a centralized rate limiter that intercepts requests before LLM routing:
interface RateLimitConfig {
tier: 'premium' | 'standard' | 'batch';
maxCostPerWindow: number;
window: 'hour' | 'day' | 'month';
Kong’s AI Rate Limiting Advanced plugin supports LLM provider-specific rate limiting with cost-aware strategies:
- name: ai-rate-limiting-advanced
This configuration enforces per-provider limits while the application layer handles cost-based routing.
Track these metrics in your observability stack:
cost_per_request (p50, p95, p99)
requests_throttled_by_tier
budget_utilization_rate
model_switch_frequency
Set up alerts when:
Daily cost exceeds 80% of budget
Tier 1 requests are throttled (SLA violation risk)
Cost-per-token increases greater than 15% week-over-week
import { Redis } from 'ioredis';
private budgetWindows = {
hour: { budget: 50, key: 'budget:hour' },
day: { budget: 500, key: 'budget:day' },
month: { budget: 5000, key: 'budget:month' }
// Model pricing per 1M tokens (verified 2025-12-27)
'gpt-4o': { input: 5.0, output: 15.0 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-3-5-sonnet': { input: 3.0, output: 15.0 },
'haiku-3.5': { input: 1.25, output: 5.0 }
tier: 'premium' | 'standard' | 'batch',
estimatedInputTokens: number,
estimatedOutputTokens: number,
// Calculate cost estimate
const pricing = this.modelPricing[preferredModel];
if (!pricing) throw new Error(`Unknown model: ${preferredModel}`);
(estimatedInputTokens / 1_000_000) * pricing.input +
(estimatedOutputTokens / 1_000_000) * pricing.output
// Premium tier bypasses cost checks
if (tier === 'premium') {
return { allowed: true, costEstimate };
// Check all budget windows
for (const [window, config] of Object.entries(this.budgetWindows)) {
const currentSpend = await this.redis.get(config.key);
const remaining = config.budget - (parseFloat(currentSpend || '0') + costEstimate);
// Budget exceeded - apply degradation
// Standard tier: route to cheaper model
const altModel = this.findCheaperModel(preferredModel);
alternateModel: altModel,
costEstimate: this.calculateCost(estimatedInputTokens, estimatedOutputTokens, altModel)
// Check per-user rate limit
const userKey = `user:${userId}:requests`;
const userCount = await this.redis.incr(userKey);
if (userCount === 1) await this.redis.expire(userKey, 3600);
return { allowed: false, retryAfter: 3600, costEstimate };
return { allowed: true, costEstimate };
private findCheaperModel(preferred: string): string {
const costs = Object.entries(this.modelPricing).map(([model, pricing]) => ({
cost: (pricing.input + pricing.output) / 2
costs.sort((a, b) => a.cost - b.cost);
private calculateCost(input: number, output: number, model: string): number {
const pricing = this.modelPricing[model];
return (input / 1_000_000) * pricing.input + (output / 1_000_000) * pricing.output;
const actualCost = this.calculateCost(actualInput, actualOutput, model);
for (const config of Object.values(this.budgetWindows)) {
await this.redis.incrbyfloat(config.key, actualCost);
const ttl = await this.redis.ttl(config.key);
const windowSeconds = config.key.includes('hour') ? 3600
: config.key.includes('day') ? 86400
await this.redis.expire(config.key, windowSeconds);
JSON.stringify({ model, actualCost, timestamp: Date.now() })
await this.redis.ltrim(`cost:log:${userId}`, 0, 99);
// Express middleware example
export const slarMiddleware = (limiter: SLARateLimiter) => {
return async (req, res, next) => {
const decision = await limiter.checkLimit(
req.body.estimatedInputTokens || 1000,
req.body.estimatedOutputTokens || 500,
req.body.model || 'gpt-4o'
error: 'Budget limit exceeded',
retryAfter: decision.retryAfter,
message: decision.retryAfter
? `Tier ${req.user.tier} budget exhausted. Retry after ${decision.retryAfter}s.`
: 'Batch processing delayed due to high load.'
// Attach routing decision to request
model: decision.alternateModel || req.body.model,
costEstimate: decision.costEstimate,
Risk : A flat 1000 req/hour limit ignores that 10 requests at $50 each cost more than 1000 requests at $0.50 each.
Solution : Always weight limits by estimated cost. Use the cost token count strategy in Kong’s AI Rate Limiting Advanced plugin to track actual spend rather than request count.
Risk : Without feedback from LLM responses, cost estimates drift from reality, causing budget breaches.
Solution : Implement a feedback loop that records actual token usage and adjusts future estimates. Kong’s plugin returns X-AI-RateLimit-Query-Cost headers that can be scraped for real-time cost tracking.
Risk : Premium users get blocked during cost pressure, violating SLAs and causing churn.
Solution : Implement priority queuing. Premium tiers should bypass cost checks entirely or have dedicated budget pools. Use Kong’s consumer groups to segment rate limits by tier.
Risk : Binary allow/block decisions create poor user experience during budget exhaustion.
Solution : Implement the degradation ladder:
Route to cheaper models
Reduce context windows
Enable response caching
Batch non-urgent requests
Only then: block with clear retry guidance
Risk : Runaway costs during incidents can bankrupt a startup in hours.
Solution : Hard limits with automatic shutdown:
Daily budget greater than 120% : Route all non-premium traffic to mini-models
Monthly budget greater than 150% : Block all non-critical requests, alert on-call
Cost spike greater than 500% in 10 minutes : Emergency circuit breaker, immediate alert
Risk : In multi-node deployments, rate limits drift, causing over-limit usage.
Solution : Use Redis-backed strategies for accurate distributed counting. Kong’s AI Rate Limiting Advanced plugin supports redis and cluster strategies for consistency across nodes docs.konghq.com .
Stage Hourly Daily Monthly Action on Breach Startup $10 $100 $1,000 Route to mini-models Growth $50 $500 $5,000 Tier 3 throttling Scale $200 $2,000 $20,000 Emergency circuit breaker
Cost Pressure less than 50% budget → Preferred Model
Cost Pressure 50-80% → Cheapest Model (mini/haiku)
Cost Pressure greater than 80% → Cache Only + Queue
Premium Tier → Always Preferred Model
# Global cost-based limits
- name: ai-rate-limiting-advanced
window_size: [3600, 86400]
window_size: [3600, 86400]
tokens_count_strategy: cost
Budget Utilization Rate : spend / budget per window
Cost per Request : p50, p95, p99 with model breakdown
Throttle Rate by Tier : Should be 0% for Tier 1
Model Switch Frequency : Track optimization effectiveness
Queue Depth : Early warning for capacity issues
Rate limiter configuration tool with SLA impact simulation
Interactive widget derived from “Rate Limiting Without Losing Revenue: SLA-Aware Cost Control” that lets readers explore rate limiter configuration tool with sla impact simulation.
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.
SLA-aware rate limiting is a financial safeguard and operational necessity for AI-powered systems. The key principles:
Cost-awareness over request-counting : Weight every request by actual token cost
Tier-based prioritization : Premium users must never be blocked by cost pressure
Dynamic routing : Automatically shift to cheaper models under load
Graceful degradation : Ladder of fallbacks before complete blocking
Distributed consistency : Redis-backed counters for multi-node accuracy
Expected Outcomes:
40-70% cost reduction through intelligent routing
Zero Tier 1 throttling incidents
Predictable budgets with automated circuit breakers
Improved reliability during traffic spikes
Next Steps:
Implement the SLARateLimiter class in your application
Configure Kong AI Rate Limiting Advanced with Redis strategy
Set up monitoring for budget utilization and throttle rates
Test degradation paths with chaos engineering
Review and adjust thresholds monthly based on actual usage