Skip to content
GitHubX/TwitterRSS

Rate Limiting Without Losing Revenue: SLA-Aware Cost Control

Rate Limiting Without Losing Revenue: SLA-Aware Cost Control

Section titled “Rate Limiting Without Losing Revenue: SLA-Aware Cost Control”

A fintech startup processing 1M+ daily requests faced a nightmare: their AI-powered fraud detection system spiked from $12K to $89K in a single week. The culprit? Unthrottled retries during a traffic surge. This guide teaches you to prevent that fate using SLA-aware rate limiting that protects both your budget and your service commitments.

Rate limiting has evolved from simple request counters to sophisticated cost-control mechanisms. With modern LLMs costing $3-15 per million tokens, a poorly configured system can burn through budgets in hours. According to Anthropic’s pricing, Claude 3.5 Sonnet charges $3.00 per million input tokens and $15.00 per million output tokens Anthropic Pricing, while OpenAI’s GPT-4o costs $5.00/$15.00 per million tokens OpenAI Pricing. At scale, these costs compound rapidly.

The business impact is severe:

  • Budget overruns: Uncontrolled requests can 5-10x your monthly bill
  • SLA violations: Throttling without prioritization leads to dropped premium-tier requests
  • Revenue loss: Degraded user experience directly impacts retention
  • Operational risk: System instability during traffic spikes

Traditional rate limiters (token bucket, fixed window) treat all traffic equally. Modern systems need adaptive throttling that considers:

  • Request cost (token volume)
  • User tier (SLA commitments)
  • Model pricing (dynamic routing)
  • Time-of-day patterns
  • Queue depth

Adaptive throttling adjusts request processing rates based on real-time system state and business rules. Unlike static limiters, it dynamically responds to cost pressure, queue depth, and SLA requirements.

1. Cost-Aware Token Bucket Traditional token buckets refill at fixed rates. Cost-aware variants weight tokens by actual model cost:

graph TD
A[Request Ingress] --> B{Cost Estimator}
B -->|High Cost| C[Priority Queue]
B -->|Low Cost| D[Standard Queue]
C --> E{Token Bucket<br/>Weighted by $}
D --> E
E --> F[LLM Provider]
F --> G[Response + Token Metadata]
G --> H[Update Cost Model]
H --> B

Key Mechanism:

  • Each request consumes tokens proportional to estimated cost (input/output token ratio × model price)
  • Buckets refill based on budget window (e.g., $100/hour)
  • High-cost requests trigger priority queuing
  • Real-time metadata from LLM responses refines cost estimates

2. Priority Queuing Requests are tagged with SLA tiers:

  • Tier 1 (Premium): Guaranteed processing, bypasses throttling
  • Tier 2 (Standard): Subject to cost-based throttling
  • Tier 3 (Batch): Lowest priority, processed during low-cost periods

3. Dynamic Routing Based on cost pressure, route requests to cheaper models:

  • High load → GPT-4o-mini ($0.15/$0.60 per 1M tokens)
  • Normal load → GPT-4o ($5/$15 per 1M tokens)
  • Critical requests → Claude 3.5 Sonnet ($3/$15 per 1M tokens)

4. Graceful Degradation When budget thresholds are approached:

  • Enable request coalescing (batch similar queries)
  • Reduce context window size
  • Switch to cached responses
  • Enable “low-cost mode” (shorter responses, reduced reasoning effort)

Create budget windows aligned with business cycles:

WindowBudgetAction on Breach
Hourly$50Enable Tier 3 throttling
Daily$500Route 50% to mini-models
Monthly$5,000Emergency circuit breaker

Step 2: Implement Adaptive Throttling Middleware

Section titled “Step 2: Implement Adaptive Throttling Middleware”

Use a centralized rate limiter that intercepts requests before LLM routing:

interface RateLimitConfig {
tier: 'premium' | 'standard' | 'batch';
model: string;
maxCostPerWindow: number;
window: 'hour' | 'day' | 'month';
}
interface CostMetadata {
inputTokens: number;
outputTokens: number;
actualCost: number;
timestamp: number;
}

Kong’s AI Rate Limiting Advanced plugin supports LLM provider-specific rate limiting with cost-aware strategies:

plugins:
- name: ai-rate-limiting-advanced
config:
llm_providers:
- name: openai
limit: [100, 1000]
window_size: [60, 3600]
- name: anthropic
limit: [50, 500]
window_size: [60, 3600]
strategy: redis
sync_rate: 0.5

This configuration enforces per-provider limits while the application layer handles cost-based routing.

Track these metrics in your observability stack:

  • cost_per_request (p50, p95, p99)
  • requests_throttled_by_tier
  • budget_utilization_rate
  • model_switch_frequency

Set up alerts when:

  • Daily cost exceeds 80% of budget
  • Tier 1 requests are throttled (SLA violation risk)
  • Cost-per-token increases greater than 15% week-over-week
import { Redis } from 'ioredis';
class SLARateLimiter {
private redis: Redis;
private budgetWindows = {
hour: { budget: 50, key: 'budget:hour' },
day: { budget: 500, key: 'budget:day' },
month: { budget: 5000, key: 'budget:month' }
};
// Model pricing per 1M tokens (verified 2025-12-27)
private modelPricing = {
'gpt-4o': { input: 5.0, output: 15.0 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-3-5-sonnet': { input: 3.0, output: 15.0 },
'haiku-3.5': { input: 1.25, output: 5.0 }
};
async checkLimit(
userId: string,
tier: 'premium' | 'standard' | 'batch',
estimatedInputTokens: number,
estimatedOutputTokens: number,
preferredModel: string
): Promise<{
allowed: boolean;
retryAfter?: number;
alternateModel?: string;
costEstimate: number;
}> {
// Calculate cost estimate
const pricing = this.modelPricing[preferredModel];
if (!pricing) throw new Error(`Unknown model: ${preferredModel}`);
const costEstimate = (
(estimatedInputTokens / 1_000_000) * pricing.input +
(estimatedOutputTokens / 1_000_000) * pricing.output
);
// Premium tier bypasses cost checks
if (tier === 'premium') {
return { allowed: true, costEstimate };
}
// Check all budget windows
for (const [window, config] of Object.entries(this.budgetWindows)) {
const currentSpend = await this.redis.get(config.key);
const remaining = config.budget - (parseFloat(currentSpend || '0') + costEstimate);
if (remaining < 0) {
// Budget exceeded - apply degradation
if (tier === 'batch') {
return {
allowed: false,
retryAfter: 3600,
costEstimate
};
}
// Standard tier: route to cheaper model
const altModel = this.findCheaperModel(preferredModel);
return {
allowed: true,
alternateModel: altModel,
costEstimate: this.calculateCost(estimatedInputTokens, estimatedOutputTokens, altModel)
};
}
}
// Check per-user rate limit
const userKey = `user:${userId}:requests`;
const userCount = await this.redis.incr(userKey);
if (userCount === 1) await this.redis.expire(userKey, 3600);
if (userCount > 100) {
return { allowed: false, retryAfter: 3600, costEstimate };
}
return { allowed: true, costEstimate };
}
private findCheaperModel(preferred: string): string {
const costs = Object.entries(this.modelPricing).map(([model, pricing]) => ({
model,
cost: (pricing.input + pricing.output) / 2
}));
costs.sort((a, b) => a.cost - b.cost);
return costs[0].model;
}
private calculateCost(input: number, output: number, model: string): number {
const pricing = this.modelPricing[model];
return (input / 1_000_000) * pricing.input + (output / 1_000_000) * pricing.output;
}
async recordActualCost(
userId: string,
model: string,
actualInput: number,
actualOutput: number
): Promise<void> {
const actualCost = this.calculateCost(actualInput, actualOutput, model);
// Update budget windows
for (const config of Object.values(this.budgetWindows)) {
await this.redis.incrbyfloat(config.key, actualCost);
const ttl = await this.redis.ttl(config.key);
if (ttl === -1) {
const windowSeconds = config.key.includes('hour') ? 3600
: config.key.includes('day') ? 86400
: 2592000;
await this.redis.expire(config.key, windowSeconds);
}
}
// Log for analytics
await this.redis.lpush(
`cost:log:${userId}`,
JSON.stringify({ model, actualCost, timestamp: Date.now() })
);
await this.redis.ltrim(`cost:log:${userId}`, 0, 99);
}
}
// Express middleware example
export const slarMiddleware = (limiter: SLARateLimiter) => {
return async (req, res, next) => {
try {
const decision = await limiter.checkLimit(
req.user.id,
req.user.tier,
req.body.estimatedInputTokens || 1000,
req.body.estimatedOutputTokens || 500,
req.body.model || 'gpt-4o'
);
if (!decision.allowed) {
res.status(429).json({
error: 'Budget limit exceeded',
retryAfter: decision.retryAfter,
message: decision.retryAfter
? `Tier ${req.user.tier} budget exhausted. Retry after ${decision.retryAfter}s.`
: 'Batch processing delayed due to high load.'
});
return;
}
// Attach routing decision to request
req.routingDecision = {
model: decision.alternateModel || req.body.model,
costEstimate: decision.costEstimate,
tier: req.user.tier
};
next();
} catch (error) {
next(error);
}
};
};

Risk: A flat 1000 req/hour limit ignores that 10 requests at $50 each cost more than 1000 requests at $0.50 each.

Solution: Always weight limits by estimated cost. Use the cost token count strategy in Kong’s AI Rate Limiting Advanced plugin to track actual spend rather than request count.

Risk: Without feedback from LLM responses, cost estimates drift from reality, causing budget breaches.

Solution: Implement a feedback loop that records actual token usage and adjusts future estimates. Kong’s plugin returns X-AI-RateLimit-Query-Cost headers that can be scraped for real-time cost tracking.

Risk: Premium users get blocked during cost pressure, violating SLAs and causing churn.

Solution: Implement priority queuing. Premium tiers should bypass cost checks entirely or have dedicated budget pools. Use Kong’s consumer groups to segment rate limits by tier.

Risk: Binary allow/block decisions create poor user experience during budget exhaustion.

Solution: Implement the degradation ladder:

  1. Route to cheaper models
  2. Reduce context windows
  3. Enable response caching
  4. Batch non-urgent requests
  5. Only then: block with clear retry guidance

Risk: Runaway costs during incidents can bankrupt a startup in hours.

Solution: Hard limits with automatic shutdown:

  • Daily budget greater than 120%: Route all non-premium traffic to mini-models
  • Monthly budget greater than 150%: Block all non-critical requests, alert on-call
  • Cost spike greater than 500% in 10 minutes: Emergency circuit breaker, immediate alert

Risk: In multi-node deployments, rate limits drift, causing over-limit usage.

Solution: Use Redis-backed strategies for accurate distributed counting. Kong’s AI Rate Limiting Advanced plugin supports redis and cluster strategies for consistency across nodes docs.konghq.com.

StageHourlyDailyMonthlyAction on Breach
Startup$10$100$1,000Route to mini-models
Growth$50$500$5,000Tier 3 throttling
Scale$200$2,000$20,000Emergency circuit breaker
Cost Pressure less than 50% budget → Preferred Model
Cost Pressure 50-80% → Cheapest Model (mini/haiku)
Cost Pressure greater than 80% → Cache Only + Queue
Premium Tier → Always Preferred Model

Kong AI Rate Limiting Advanced Configuration

Section titled “Kong AI Rate Limiting Advanced Configuration”
# Global cost-based limits
plugins:
- name: ai-rate-limiting-advanced
config:
llm_providers:
- name: openai
limit: [1000, 10000]
window_size: [3600, 86400]
- name: anthropic
limit: [500, 5000]
window_size: [3600, 86400]
strategy: redis
sync_rate: 0.5
tokens_count_strategy: cost
  • Budget Utilization Rate: spend / budget per window
  • Cost per Request: p50, p95, p99 with model breakdown
  • Throttle Rate by Tier: Should be 0% for Tier 1
  • Model Switch Frequency: Track optimization effectiveness
  • Queue Depth: Early warning for capacity issues

Rate limiter configuration tool with SLA impact simulation

Interactive widget derived from “Rate Limiting Without Losing Revenue: SLA-Aware Cost Control” that lets readers explore rate limiter configuration tool with sla impact simulation.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

SLA-aware rate limiting is a financial safeguard and operational necessity for AI-powered systems. The key principles:

  1. Cost-awareness over request-counting: Weight every request by actual token cost
  2. Tier-based prioritization: Premium users must never be blocked by cost pressure
  3. Dynamic routing: Automatically shift to cheaper models under load
  4. Graceful degradation: Ladder of fallbacks before complete blocking
  5. Distributed consistency: Redis-backed counters for multi-node accuracy

Expected Outcomes:

  • 40-70% cost reduction through intelligent routing
  • Zero Tier 1 throttling incidents
  • Predictable budgets with automated circuit breakers
  • Improved reliability during traffic spikes

Next Steps:

  1. Implement the SLARateLimiter class in your application
  2. Configure Kong AI Rate Limiting Advanced with Redis strategy
  3. Set up monitoring for budget utilization and throttle rates
  4. Test degradation paths with chaos engineering
  5. Review and adjust thresholds monthly based on actual usage