Skip to content
GitHubX/TwitterRSS

Reducing Retries & Failed Calls: The Hidden Cost Driver

Reducing Retries & Failed Calls: The Hidden Cost Driver

Section titled “Reducing Retries & Failed Calls: The Hidden Cost Driver”

A single production incident at a mid-sized SaaS company burned through $12,000 in LLM costs in 6 hours. The culprit? A retry storm where a misconfigured API endpoint caused 500,000 failed requests, each consuming tokens before failing. Your retry rate isn’t just a reliability metric—it’s a direct multiplier on your AI budget that can silently destroy your unit economics.

In traditional software, retries are cheap. In LLM systems, every retry is a full-cost API call that may fail, consuming input and output tokens before you see an error. Unlike microservices where a retry costs milliseconds of compute, an LLM retry costs real dollars.

The math is brutal. Consider a system processing 1M requests/day with a 3% retry rate:

  • Successful calls: 970,000 × $0.01 = $9,700
  • Failed retries: 30,000 × $0.01 = $300
  • Total waste: $300/day = $9,000/month

But that’s optimistic. Most retry storms involve multiple retry attempts, context regeneration, and cascading failures. Real-world systems often see 5-10x cost inflation from poor error handling.

Failed calls aren’t just wasted tokens—they’re wasted expensive tokens:

Failure PointTokens ConsumedCost per Failed Call (Claude 3.5 Sonnet)
Input validation100-500$0.0003 - $0.0015
Context processing2,000-10,000$0.006 - $0.030
Partial output500-5,000$0.0075 - $0.075
Guardrail violation100-1,000$0.0003 - $0.003

Pricing based on Claude 3.5 Sonnet: $3.00/$15.00 per 1M tokens

Most LLM SDKs implement automatic retries for transient failures (rate limits, 500 errors). While necessary, these can compound costs:

The financial impact of retries extends beyond simple token waste—it fundamentally changes your cost structure and predictability. When error rates spike, costs scale linearly with failure frequency, not with successful business outcomes.

Consider these common failure patterns:

Scenario A: Input Validation Failures

  • User submits malformed data → API rejects after 500 token processing
  • Cost: $0.0015 per failure (Claude 3.5 Sonnet)
  • Impact: 1,000 daily failures = $45/month wasted

Scenario B: Context Window Overflow

  • System sends 250K tokens to 200K context model → rejected
  • Cost: $0.75 per failure (full context processing before rejection)
  • Impact: 100 daily failures = $2,250/month wasted

Scenario C: Guardrail Violations

  • Content policy violation detected mid-generation → partial output
  • Cost: $0.015 per failure (50% output tokens consumed)
  • Impact: 500 daily failures = $2,250/month wasted

Implement validation before API calls to catch 80% of failures:

// Validation gate that prevents costly API calls
interface ValidationResult {
isValid: boolean;
estimatedTokenCount: number;
rejectionReason?: string;
}
async function validateRequest(
input: string,
context: any[]
): Promise<ValidationResult> {
// Token estimation (critical for cost control)
const estimatedTokens = estimateTokenCount(input + JSON.stringify(context));
if (estimatedTokens > 200000) {
return {
isValid: false,
estimatedTokenCount: estimatedTokens,
rejectionReason: "Context exceeds 200K token limit"
};
}
// Content safety check
if (containsRestrictedContent(input)) {
return {
isValid: false,
estimatedTokenCount: estimatedTokens,
rejectionReason: "Content violates usage policies"
};
}
return { isValid: true, estimatedTokenCount: estimatedTokens };
}
// Usage
const validation = await validateRequest(userInput, context);
if (!validation.isValid) {
// Reject immediately - zero token cost
return { error: validation.rejectionReason };
}

Not all errors deserve retries. Implement error classification:

const RETRYABLE_ERRORS = [
'rate_limit_exceeded',
'overloaded',
'timeout'
];
const NON_RETRYABLE_ERRORS = [
'invalid_request',
'content_policy_violation',
'insufficient_quota',
'context_length_exceeded'
];
async function intelligentRetry<T>(
operation: () => Promise<T>,
maxRetries: number = 3
): Promise<T> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error: any) {
const errorType = error?.error?.type || 'unknown';
if (NON_RETRYABLE_ERRORS.includes(errorType)) {
// Fail fast - no retry
throw error;
}
if (!RETRYABLE_ERRORS.includes(errorType)) {
// Unknown error - log but don't retry
console.warn(`Non-retryable error: ${errorType}`);
throw error;
}
// Exponential backoff with jitter
const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 100, 10000);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Max retries exceeded');
}

Prevent retry storms during outages:

class CircuitBreaker {
private failures = 0;
private lastFailureTime: number | null = null;
private state: 'closed' | 'open' | 'half-open' = 'closed';
private readonly threshold = 5;
private readonly timeout = 60000; // 1 minute
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
const timeSinceFailure = Date.now() - (this.lastFailureTime || 0);
if (timeSinceFailure < this.timeout) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'half-open';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}

Set hard limits per request to prevent runaway costs:

class TokenBudget {
private spent = 0;
constructor(private readonly budget: number) {}
check(estimated: number): boolean {
return (this.spent + estimated) <= this.budget;
}
spend(actual: number) {
this.spent += actual;
}
get remaining() {
return this.budget - this.spent;
}
}
// Usage
const budget = new TokenBudget(50000); // 50K tokens max per request
async function processWithBudget(input: string) {
const estimated = estimateTokenCount(input);
if (!budget.check(estimated)) {
throw new Error(`Budget exceeded: ${budget.remaining} tokens remaining`);
}
const response = await apiCall(input);
budget.spend(response.usage.total);
return response;
}

Here’s a complete production-ready error handling system that integrates all strategies:

import { Anthropic } from '@anthropic-ai/sdk';
// Configuration
const PRICING = {
'claude-3-5-sonnet': { input: 3.0, output: 15.0 }, // per 1M tokens
'gpt-4o': { input: 5.0, output: 15.0 },
} as const;
class LLMCostGuard {
private client: Anthropic;
private circuitBreaker: CircuitBreaker;
private tokenBudget: TokenBudget;
constructor(private config: {
model: string;
maxTokensPerRequest: number;
maxRetries: number;
}) {
this.client = new Anthropic();
this.circuitBreaker = new CircuitBreaker();
this.tokenBudget = new TokenBudget(config.maxTokensPerRequest);
}
async generate(
prompt: string,
context: string[] = []
): Promise<{ success: boolean; cost: number; error?: string }> {
// Step 1: Pre-validation
const validation = await this.validate(prompt, context);
if (!validation.isValid) {
return { success: false, cost: 0, error: validation.reason };
}
// Step 2: Check budget
const estimatedTokens = validation.estimatedTokens;
if (!this.tokenBudget.check(estimatedTokens)) {
return {
success: false,
cost: 0,
error: `Token budget exceeded: ${this.tokenBudget.remaining} remaining`
};
}
// Step 3: Execute with intelligent retry
try {
const result = await this.circuitBreaker.call(async () => {
return await this.intelligentRetry(async () => {
const response = await this.client.messages.create({
model: this.config.model,
max_tokens: 4096,
messages: [
{ role: 'user', content: prompt }
]
});
// Calculate actual cost
const inputTokens = response.usage.input_tokens;
const outputTokens = response.usage.output_tokens;
const pricing = PRICING[this.config.model as keyof typeof PRICING];
const cost = (
(inputTokens / 1_000_000) * pricing.input +
(outputTokens / 1_000_000) * pricing.output
);
this.tokenBudget.spend(inputTokens + outputTokens);
return {
content: response.content,
cost: cost,
usage: response.usage
};
}, this.config.maxRetries);
});
return { success: true, cost: result.cost };
} catch (error: any) {
// Log failed call cost (still burned tokens)
const failedCost = this.calculateFailedCallCost(error);
console.error(`Failed call cost: ${failedCost.toFixed(4)}`, {
error: error.message,
model: this.config.model
});
return {
success: false,
cost: failedCost,
error: error.message
};
}
}
private async validate(prompt: string, context: string[]) {
// Token estimation
const estimatedTokens = Math.ceil(prompt.length / 4) +
context.reduce((sum, c) => sum + Math.ceil(c.length / 4), 0);
if (estimatedTokens > 200000) {
return {
isValid: false,
estimatedTokens,
reason: `Context exceeds 200K token limit (estimated: ${estimatedTokens})`
};
}
// Basic content safety
if (this.containsMaliciousPattern(prompt)) {
return {
isValid: false,
estimatedTokens,
reason: "Content violates safety policies"
};
}
return { isValid: true, estimatedTokens, reason: "" };
}
private containsMaliciousPattern(input: string): boolean {
const patterns = [
/(?i)jailbreak|system prompt override/,
/(?i)ignore previous instructions/,
/base64|encode.*decode/
];
return patterns.some(p => p.test(input));
}
private async intelligentRetry<T>(
operation: () => Promise<T>,
maxRetries: number
): Promise<T> {
const retryableErrors = [
'rate_limit_exceeded',
'overloaded_error',
'api_error',
'timeout'
];
const nonRetryableErrors = [
'invalid_request_error',
'content_policy_violation',
'insufficient_quota',
'context_length_exceeded',
'authentication_error',
'permission_error'
];
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error: any) {
const errorType = error?.error?.type || 'unknown';
// Fail fast on non-retryable errors
if (nonRetryableErrors.includes(errorType)) {
throw error;
}
// Log but don't retry unknown errors
if (!retryableErrors.includes(errorType)) {
console.warn(`Unknown error type: ${errorType}`);
throw error;
}
// Exponential backoff with jitter
const baseDelay = Math.pow(2, attempt) * 1000;
const jitter = Math.random() * 1000;
const delay = Math.min(baseDelay + jitter, 10000);
console.log(`Retry ${attempt}/${maxRetries} after ${delay}ms delay`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error(`Max retries (${maxRetries}) exceeded`);
}
private calculateFailedCallCost(error: any): number {
// Failed calls still consume tokens before failing
// Estimate based on typical failure patterns
const pricing = PRICING[this.config.model as keyof typeof PRICING];
const errorType = error?.error?.type || 'unknown';
let estimatedInputTokens: number;
let estimatedOutputTokens: number;
switch (errorType) {
case 'context_length_exceeded':
estimatedInputTokens = 150000; // Full context processed
estimatedOutputTokens = 0;
break;
case 'content_policy_violation':
estimatedInputTokens = 500; // Initial check
estimatedOutputTokens = 200; // Partial generation
break;
case 'rate_limit_exceeded':
estimatedInputTokens = 100; // Minimal processing
estimatedOutputTokens = 0;
break;
default:
estimatedInputTokens = 1000; // Typical partial processing
estimatedOutputTokens = 500;
}
return (
(estimatedInputTokens / 1_000_000) * pricing.input +
(estimatedOutputTokens / 1_000_000) * pricing.output
);
}
}
// Usage example
const guard = new LLMCostGuard({
model: 'claude-3-5-sonnet',
maxTokensPerRequest: 50000,
maxRetries: 3
});
const result = await guard.generate(
"Analyze this customer feedback and provide sentiment analysis",
["Previous conversation context..."]
);
if (result.success) {
console.log(`Success! Cost: ${result.cost.toFixed(4)}`);
} else {
console.error(`Failed: ${result.error} | Cost burned: ${result.cost.toFixed(4)}`);
}

The Mistake: Retrying every error including validation failures and content policy violations.
The Cost: A 5% validation error rate with 3 retries = 15% wasted spend.
The Fix: Classify errors and only retry transient failures.

The Mistake: Allowing unlimited context per request.
The Cost: A single malformed request with 500K tokens can cost $1.50+ before failing.
The Fix: Implement hard token limits with pre-validation.

The Mistake: Continuous retries during provider outages.
The Cost: 1000 requests × 3 retries × $0.01 = $30 burned during a 5-minute outage.
The Fix: Implement circuit breakers that fail fast during systemic issues.

The Mistake: Treating all failures as zero-cost events.
The Cost: Guardrail violations often consume 50-80% of expected output tokens.
The Fix: Track and attribute costs for partial failures.

The Mistake: Treating all API costs as “successful call” expenses.
The Cost: Budget overruns without understanding root causes.
The Fix: Tag costs by error type for accurate unit economics.

Error TypeRetry?Typical CostPrevention Strategy
invalid_request_errorNo$0.0003 - $0.0015Pre-validation
content_policy_violationNo$0.0003 - $0.003Content filtering
context_length_exceededNo$0.006 - $0.75Token estimation
rate_limit_exceededYes$0.0003Queue + backoff
overloaded_errorYes$0.0003Circuit breaker
api_errorYes$0.001 - $0.01Retry with jitter
  • Pre-flight validation on 100% of requests
  • Token estimation with 200K hard limit
  • Error classification for retry logic
  • Circuit breaker for systemic failures
  • Token budgets per request/user
  • Cost logging for all failures
  • Alerting on retry rate greater than 3%
  • Monthly audit of error costs
ModelInput/1MOutput/1MContext
Claude 3.5 Sonnet$3.00$15.00200K
Claude 3.5 Haiku$1.25$5.00200K
GPT-4o$5.00$15.00128K
GPT-4o Mini$0.15$0.60128K

Source: Anthropic Pricing, OpenAI Pricing

Use this formula to calculate your retry waste:

Monthly Waste = (Daily Requests × Retry Rate × Avg Retries) × Cost per Failed Call × 30
Where:
- Daily Requests: Your system's daily volume
- Retry Rate: Percentage of requests that fail (e.g., 0.05 for 5%)
- Avg Retries: Average retry attempts per failure (e.g., 2.5)
- Cost per Failed Call: $0.001 - $0.01 depending on failure type

Example Calculation:

  • 500,000 daily requests
  • 3% retry rate (15,000 failures)
  • 2.5 average retries
  • $0.005 average cost per failed call

Monthly Waste: 15,000 × 2.5 × $0.005 × 30 = $5,625

Request Received
[Token Estimation > Limit?] → YES → Reject (Cost: $0)
↓ NO
[Content Safety Check] → FAIL → Reject (Cost: $0.0003)
↓ PASS
[API Call] → SUCCESS → Process (Cost: calculated)
↓ FAIL
[Error Type?] → Non-Retryable → Log & Alert
↓ Retryable
[Exponential Backoff] → Max Retries Exceeded → Circuit Breaker Opens

Error cost simulator (retry rate → total cost impact)

Interactive widget derived from “Reducing Retries & Failed Calls: The Hidden Cost Driver” that lets readers explore error cost simulator (retry rate → total cost impact).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.