Skip to content
GitHubX/TwitterRSS

Prompt Caching & Token Reuse: 50% Discounts on Repetitive Calls

Prompt Caching & Token Reuse: 50% Discounts on Repetitive Calls

Section titled “Prompt Caching & Token Reuse: 50% Discounts on Repetitive Calls”

A single engineering team burned through $12,000 in API costs during a hackathon—reprocessing the same 2,000-token system prompt 3,000 times. Their mistake? They didn’t enable prompt caching. With caching, that same workload would have cost $6,000. At scale, this difference determines whether your AI feature is profitable or bankrupt.

Prompt caching isn’t a minor optimization—it’s a fundamental cost-control mechanism for production LLM applications. When your system prompt, RAG context, or conversation history exceeds the minimum token threshold (1,024+ tokens for most providers), every identical prefix token gets discounted. For applications with 10,000+ daily requests, this translates to $10,000-$50,000 monthly savings depending on model choice and reuse patterns.

The impact extends beyond cost. Cached prompts process 2-5x faster because the provider skips tokenization and initial layer computation. This latency reduction directly improves user experience and throughput capacity. However, the economics only work if you understand the mechanics: cache keys, expiration policies, and token counting rules differ across providers.

Consider a typical RAG application. Your system prompt might be 1,500 tokens, and retrieved documents add another 3,000 tokens. If you process 1,000 requests/hour with the same base context, you’re burning 4.5M input tokens/hour uncached. With caching, you pay for 1.5M tokens (the unique prefixes) plus 3M discounted tokens. At GPT-4o pricing ($5/M uncached, $2.50/M cached), that’s $22.50/hour vs $45/hour—a 50% savings that compounds across your entire infrastructure.

Prompt caching operates on prefix matching. When you send a request, the provider checks if the first N tokens match a cached entry. If they do, those tokens are billed at a discount and processed faster. If any character differs within the cacheable prefix, you get a full cache miss.

Each provider implements caching differently. Understanding these distinctions is critical for multi-provider architectures.

OpenAI uses automatic caching for prompts longer than 1,024 tokens with identical prefixes. The discount is 50% on cached input tokens. Caches expire after 5-10 minutes of inactivity, with a hard limit of one hour. The cached token count appears in response.usage.prompt_tokens_details.cached_tokens.

Google (Gemini API) offers two modes. Implicit caching is enabled by default on Gemini 2.5 models with minimums of 1,024 tokens (Flash) or 4,096 tokens (Pro). Explicit caching lets you create a cache object with a TTL (time-to-live) of up to several hours. The discount is 90% on cached tokens for Gemini 2.5 models, significantly higher than OpenAI’s 50%.

Azure OpenAI mirrors OpenAI’s behavior but adds a prompt_cache_key parameter for custom cache management. Caches persist for 5-10 minutes of activity, with guaranteed cleanup within one hour.

The 1,024 token minimum is a hard floor. Prompts shorter than this don’t qualify for caching. More importantly, the cache key is the exact byte sequence of the first 1,024+ tokens. A single space, comma, or newline difference invalidates the cache.

Let’s calculate real-world impact. Assume:

  • System prompt: 2,000 tokens
  • Per-request context: 500 tokens
  • Total per request: 2,500 input tokens
  • Volume: 50,000 requests/day
  • Model: GPT-4o ($5.00/M uncached, $2.50/M cached)

Without caching: 50,000 × 2,500 = 125M tokens/day = $625/day

With caching (assuming 90% cache hit rate on the 2,000-token prefix):

  • Unique tokens: 2,000 (first request) + (500 × 50,000) = 25.002M tokens
  • Cached tokens: (2,000 × 50,000) - 2,000 = 99.998M tokens
  • Cost: (25.002M × $5.00) + (99.998M × $2.50) = $125.01 + $249.995 = $375/day

Savings: $250/day (40%). For a high-traffic application, this scales to $7,500/month.

  1. Identify cacheable content: Separate static context (system prompts, RAG documents, conversation history) from dynamic content (user queries, timestamps, random seeds). Static content should exceed 1,024 tokens.

  2. Structure prompts for prefix matching: Place all cacheable content at the beginning of your prompt. User messages and dynamic data go at the end. This ensures the cache key remains stable.

  3. Implement cache monitoring: Track cached_tokens in every response. Log hit rates and calculate actual savings. Without measurement, you can’t optimize.

  4. Set up cache warming: For critical workflows, send a “warmup” request after idle periods to repopulate the cache before real traffic arrives.

  5. Handle cache expiration: Design for cache misses. Your system should perform acceptably even when cold, though slower and more expensive.

import openai
from datetime import datetime
# ❌ WRONG - Cache miss on every request
def get_response_wrong(query):
system_prompt = f"Current time: {datetime.now()}. You are a helpful assistant."
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
)
return response
# ✅ CORRECT - Cacheable prefix
def get_response_correct(query):
# Static content at the beginning
system_prompt = "You are a helpful assistant."
# Dynamic content at the end
user_message = f"Current time: {datetime.now()}. Question: {query}"
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
)
# Monitor cache performance
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
total_tokens = response.usage.prompt_tokens
hit_rate = cached_tokens / total_tokens if total_tokens > 0 else 0
print(f"Cache hit rate: {hit_rate:.1%} ({cached_tokens}/{total_tokens} tokens)")
return response

Even with caching enabled, subtle implementation mistakes can eliminate your cost savings. Here are the most frequent failure modes that turn cache hits into expensive misses.

Pitfall: Including timestamps, request IDs, or random seeds in your system prompt.

# ❌ WRONG - Cache miss on every request
system_prompt = f"Current time: {datetime.now()}. You are a helpful assistant."
# The timestamp changes every second, breaking cache continuity
# ✅ CORRECT - Cacheable prefix
system_prompt = "You are a helpful assistant."
# Dynamic content goes at the end
user_message = f"Current time: {datetime.now()}. Question: {query}"

Impact: 100% cache miss rate, zero savings.

Pitfall: Expecting caching benefits for prompts under 1,024 tokens.

ProviderMinimum TokensDiscount
OpenAI/Azure1,02450%
Google Gemini 2.5 Flash1,02490%
Google Gemini 2.5 Pro4,09690%
Vertex AI (all models)2,04890%

Impact: No caching applied, full price for all tokens.

Pitfall: Changing image detail parameters or tool order between requests.

# ❌ WRONG - Different detail parameters
messages = [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": img_url, "detail": "high"}},
"Describe this"
]}
]
# ✅ CORRECT - Consistent parameters
messages = [
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": img_url, "detail": "low"}}, # Always "low"
"Describe this"
]}
]

Impact: Different tokenization = cache miss.

Pitfall: Assuming caches persist indefinitely.

Provider TTL Policies:

  • OpenAI/Azure: 5-10 minutes of inactivity, max 1 hour
  • Google (explicit): Configurable TTL (1 minute to several hours)
  • Google (implicit): Automatic, no guarantee

Mitigation: Implement cache warming or graceful degradation.

Pitfall: Deploying caching without tracking cached_tokens in responses.

# Essential monitoring
response = client.chat.completions.create(...)
cached = response.usage.prompt_tokens_details.cached_tokens
hit_rate = cached / response.usage.prompt_tokens
if hit_rate < 0.5:
logger.warning(f"Low cache hit rate: {hit_rate:.1%}")

Pitfall: Not using prompt_cache_key for shared prefixes across different workflows.

Solution: Use consistent keys for related requests to improve hit rates.

ProviderModelMin TokensDiscountMax TTLCache Key
OpenAIGPT-4o1,02450%1 hourPrefix hash
OpenAIGPT-4o mini1,02450%1 hourPrefix hash
OpenAIo1-preview1,02450%1 hourPrefix hash
Azure OpenAIGPT-4o+1,02450%1 hourPrefix + prompt_cache_key
Google (Gemini API)2.5 Flash1,02490%ConfigurableCache object
Google (Gemini API)2.5 Pro4,09690%ConfigurableCache object
Google (Vertex AI)All models2,04890%ConfigurableCache object
ModelUncached InputCached InputOutputSavings
GPT-4o$5.00$2.50$15.0050%
GPT-4o mini$0.15$0.075$0.6050%
o1-preview$15.00$7.50$60.0050%
o1-mini$3.00$1.50$12.0050%

OpenAI/Azure:

{
"usage": {
"prompt_tokens": 2500,
"prompt_tokens_details": {
"cached_tokens": 1472
}
}
}

Google (Vertex AI):

{
"usageMetadata": {
"promptTokenCount": 2500,
"cachedContentTokenCount": 1472
}
}

Use Implicit Caching (automatic):

  • High-volume, repetitive prompts
  • No need for guaranteed cache persistence
  • Cost optimization is priority

Use Explicit Caching (manual):

  • Need guaranteed discount
  • Long-running batch jobs
  • Precise cache lifetime control
  • Multi-hour TTL required

Savings calculator (input repeat rate → cost savings)

Interactive widget derived from “Prompt Caching & Token Reuse: 50% Discounts on Repetitive Calls” that lets readers explore savings calculator (input repeat rate → cost savings).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Prompt caching transforms fixed infrastructure costs into variable savings that scale with reuse. The key insight: caching is a structural optimization, not a configuration toggle. Your prompt architecture determines success.

Core Takeaways:

  • 50-90% discounts on repetitive input tokens
  • 2-5x latency reduction for cached prompts
  • 1,024 token minimum is non-negotiable
  • Prefix matching demands exact byte-for-byte identity
  • 5-10 minute TTL requires cache warming strategies

Implementation Priority:

  1. Structure prompts with static content first (≥1,024 tokens)
  2. Monitor cached_tokens in every response
  3. Design for cache misses (graceful degradation)
  4. Calculate actual savings vs. implementation effort
  5. Scale caching strategy with request volume

When to Invest:

  • High-volume (>10K requests/day): Immediate ROI
  • Long contexts (RAG, code analysis): Massive savings
  • Chat applications: Conversation history caching
  • Multi-tenant systems: Per-tenant cache isolation

When to Skip:

  • Low volume (< 1K requests/day): Overhead exceeds savings
  • Short prompts (< 1,024 tokens): No caching available
  • High

OpenAI & Azure OpenAI

Google Gemini & Vertex AI

SDKs & Libraries

  • openai Python package (v1.45+) - Automatic prompt caching support
  • @google-cloud/vertexai - Explicit cache creation and management
  • openai Node.js (v4.50+) - TypeScript support for cached_tokens access

Monitoring & Debugging

  • Track cached_tokens in every response to measure actual savings
  • Use provider-specific usage metadata fields:
    • OpenAI: response.usage.prompt_tokens_details.cached_tokens
    • Google: response.usage_metadata.cached_content_token_count

Pricing Reference (per 1M tokens)

ModelUncached InputCached InputOutputDiscount
GPT-4o$5.00$2.50$15.0050%
GPT-4o mini$0.15$0.075$0.6050%
o1-preview$15.00$7.50$60.0050%
o1-mini$3.00$1.50$12.0050%

Token Thresholds by Provider

  • OpenAI/Azure: 1,024 tokens minimum
  • Google Gemini 2.5 Flash: 1,024 tokens minimum
  • Google Gemini 2.5 Pro: 4,096 tokens minimum
  • Google Vertex AI: 2,048 tokens minimum

Best Practices

  • Structure prompts with static content first (≥1,024 tokens)
  • Place dynamic content at the end of messages
  • Use consistent formatting and parameters across requests
  • Implement cache warming for critical workflows
  • Monitor hit rates and calculate actual ROI

Common Pitfalls to Avoid

  • ❌ Including timestamps or IDs in system prompts
  • ❌ Changing single characters in the first 1,024 tokens
  • ❌ Using sub-threshold prompts (< 1,024 tokens)
  • ❌ Inconsistent image/tool definitions
  • ❌ Assuming indefinite cache persistence

Provider Status

  • OpenAI: Feature generally available since Oct 2024
  • Azure OpenAI: Available for GPT-4o+ models
  • Google: Implicit caching enabled by default on Gemini 2.5
  • Vertex AI: Explicit caching with configurable TTL

Version Requirements

  • OpenAI API: v1.45.0+
  • Azure OpenAI: 2024-08-01 preview or later
  • Google Gemini API: Latest SDK recommended