A Series A startup recently discovered their RAG pipeline was costing $18,000 per month—90% of that spend was redundant context being sent to the same 200 user queries. After implementing a three-layer caching strategy, they reduced costs to $2,100 while improving response times by 65%. This guide will show you how to architect caching at multiple levels to achieve similar results.
LLM inference costs follow a brutal exponential curve. At scale, even small inefficiencies compound into massive bills. The four verified pricing models from our research show the stakes:
Consider a typical RAG application: 10,000 daily queries, average 5,000 tokens per request (context + response). Without caching:
Daily cost with GPT-4o : $1,000 input + $750 output = $1,750/day
Monthly cost : $52,500
With an 80% cache hit rate across all three layers:
Daily cost : $350 input + $150 output = $500/day
Monthly cost : $15,000
Savings : $37,500/month (71% reduction)
The math is undeniable: caching isn’t optional at scale—it’s survival.
Prompt caching stores entire conversation contexts or system prompts that remain constant across requests. This is the highest-value layer for most applications.
What gets cached:
System instructions (persona, rules, format requirements)
Few-shot examples (demonstration pairs)
RAG context that’s identical across queries
Long, static background documents
How it works:
Modern LLM APIs support prompt caching by checking if the prefix of your prompt matches previously processed prompts. If you send a 10,000-token system prompt followed by a 100-token user query, the API only charges for the 100 tokens on cache hits after the first request.
Verified pricing impact:
Anthropic’s prompt caching offers:
Write cost : $3.75/1M tokens (for cache writes)
Read cost : $0.30/1M tokens (for cache reads)
TTL : 5 minutes default, configurable up to 1 hour
Compare this to normal input pricing of $3.00/1M tokens for Claude 3.5 Sonnet. After 10 cache reads, you’ve already saved money.
Embedding caching stores the vector representations of frequently queried text, eliminating redundant embedding API calls and vector database searches.
What gets cached:
User query embeddings
Retrieved document embeddings
Similarity search results
RAG context assemblies
The cost structure:
Embedding models are cheaper than LLMs but still add up:
OpenAI text-embedding-3-small : $0.02/1M tokens
OpenAI text-embedding-3-large : $0.13/1M tokens
Cohere embed-english-v3 : $0.10/1M tokens
For 10,000 daily queries averaging 500 tokens each:
Without cache : 5M tokens/day × $0.02 = $100/day
With 70% hit rate : $30/day
Annual savings : $25,550
But the real win is latency. Vector searches add 50-200ms per query. Cache hits reduce this to less than 5ms.
Response caching stores complete LLM outputs for identical or semantically similar queries. This is the most aggressive caching strategy but offers massive savings for deterministic tasks.
What gets cached:
Exact query matches
Semantic similarity hits (greater than 95% similarity)
Structured data generation (JSON, SQL, etc.)
Code generation for common patterns
Implementation considerations:
Response caching requires careful invalidation logic. Stale responses can cause:
Incorrect recommendations
Security vulnerabilities
Compliance issues
When to use it:
Customer support FAQs
Documentation generation
Data transformation tasks
Code review comments
Instrument your application to track query patterns, repetition rates, and token usage across all three layers. Without data, you’re guessing.
Implement prompt caching first —it’s the lowest-hanging fruit. Most modern APIs support it natively with simple flags.
Add embedding cache using Redis or your vector DB’s native caching. Cache query-document pairs for 24-48 hours.
Deploy response cache only after measuring query similarity. Start with exact matches, then expand to semantic similarity.
Set up cache invalidation based on your data freshness requirements. Use TTLs and event-driven invalidation.
Monitor hit rates and costs continuously. Aim for greater than 70% hit rate on prompt cache, greater than 50% on embedding cache.
from typing import Optional, Dict, Any
from dataclasses import dataclass
from openai import OpenAI
"""Configuration for multi-level caching"""
prompt_ttl: int = 300 # 5 minutes
embedding_ttl: int = 86400 # 24 hours
response_ttl: int = 3600 # 1 hour
similarity_threshold: float = 0.95
def __init__(self, redis_client: redis.Redis, openai_client: OpenAI):
self.redis = redis_client
self.openai = openai_client
self.config = CacheConfig()
def _hash_prompt(self, prompt: str) -> str:
"""Create deterministic hash for prompt prefix"""
return hashlib.sha256(prompt.encode()).hexdigest()
async def get_prompt_cache(self, system_prompt: str, user_query: str) -> Optional[str]:
Caches system prompt + query prefix for 5 minutes
# For prompt caching, we cache the full context assembly
cache_key = f"prompt:v1:{self._hash_prompt(system_prompt + user_query[:100])}"
cached = self.redis.get(cache_key)
# Cache miss - generate and store
response = self.openai.chat.completions.create(
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
result = response.choices[0].message.content
self.redis.setex(cache_key, self.config.prompt_ttl, result)
async def get_embedding_cache(self, text: str, doc_ids: list) -> Optional[list]:
Caches query embeddings + retrieved document IDs for 24 hours
# Hash the text to create cache key
text_hash = hashlib.sha256(text.encode()).hexdigest()
cache_key = f"embedding:v1:{text_hash}"
cached = self.redis.get(cache_key)
return json.loads(cached)
# Cache miss - generate embedding
embedding = self.openai.embeddings.create(
model="text-embedding-3-small",
# In production, you'd query your vector DB here
# For demo, we simulate retrieval
retrieved_docs = self._vector_search(embedding.data[0].embedding, doc_ids)
# Cache the retrieved doc IDs (not full text)
self.config.embedding_ttl,
json.dumps(retrieved_docs)
async def get_response_cache(self, query: str) -> Optional[str]:
Caches exact query matches for 1 hour
cache_key = f"response:v1:{self._hash_prompt(query)}"
cached = self.redis.get(cache_key)
async def set_response_cache(self, query: str, response: str):
"""Store response in cache"""
cache_key = f"response:v1:{self._hash_prompt(query)}"
self.redis.setex(cache_key, self.config.response_ttl, response)
def _vector_search(self, embedding: list, doc_ids: list) -> list:
"""Simulated vector search - replace with actual DB query"""
# In production: query Pinecone, Weaviate, Qdrant, etc.
return doc_ids[:5] # Return top 5 matches
async def process_query(system_prompt: str, user_query: str, doc_ids: list):
redis_client=redis.Redis(host='localhost', port=6379),
# Check response cache first (cheapest)
cached_response = await cache.get_response_cache(user_query)
return {"source": "response_cache", "data": cached_response}
prompt_result = await cache.get_prompt_cache(system_prompt, user_query)
# Check embedding cache during RAG
retrieved_docs = await cache.get_embedding_cache(user_query, doc_ids)
await cache.set_response_cache(user_query, prompt_result)
"response": prompt_result,
import crypto from 'crypto';
import Redis from 'ioredis';
import OpenAI from 'openai';
promptTTL: number; // seconds
embeddingTTL: number; // seconds
responseTTL: number; // seconds
similarityThreshold: number;
interface CacheResult<T> {
private config: CacheConfig;
constructor(redisUrl: string, openaiKey: string) {
this.redis = new Redis(redisUrl);
this.openai = new OpenAI({ apiKey: openaiKey });
similarityThreshold: 0.95
private hashPrompt(text: string): string {
return crypto.createHash('sha256').update(text).digest('hex');
async getCachedResponse(query: string): Promise<CacheResult<string>> {
const start = Date.now();
const cacheKey = `response:v1:${this.hashPrompt(query)}`;
const cached = await this.redis.get(cacheKey);
latencyMs: Date.now() - start
return { hit: false, latencyMs: Date.now() - start };
async cacheResponse(query: string, response: string): Promise<void> {
const cacheKey = `response:v1:${this.hashPrompt(query)}`;
await this.redis.setex(cacheKey, this.config.responseTTL, response);
): Promise<CacheResult<string>> {
const start = Date.now();
const cacheKey = `prompt:v1:${this.hashPrompt(systemPrompt + userQuery.slice(0, 100))}`;
const cached = await this.redis.get(cacheKey);
latencyMs: Date.now() - start
const response = await this.openai.chat.completions.create({
{ role: "system", content: systemPrompt },
{ role: "user", content: userQuery }
const result = response.choices[0].message.content || "";
await this.redis.setex(cacheKey, this.config.promptTTL, result);
latencyMs: Date.now() - start
async getEmbeddedDocs(query: string, docIds: string[]): Promise<string[]> {
const hash = this.hashPrompt(query);
const cacheKey = `embedding:v1:${hash}`;
const cached = await this.redis.get(cacheKey);
return JSON.parse(cached);
const embedding = await this.openai.embeddings.create({
model: "text-embedding-3-small",
// Simulate vector search
const retrieved = docIds.slice(0, 5);
await this.redis.setex(cacheKey, this.config.embeddingTTL, JSON.stringify(retrieved));
Cache Invalidation Failures
The most expensive caching mistake is serving stale data. A financial services company cached compliance responses for 24 hours and served outdated regulatory guidance, resulting in a $2M compliance fine. Their cache hit rate was 94%—but their risk exposure was 100%.
Problem: Changing a single character in the first 1,024 tokens invalidates the entire prompt cache.
Real-world example:
# Day 1: Cache miss (first request)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-27."
# Cost: $0.05 for 2,000 tokens
# Day 2: Cache miss (date changed)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-28."
# Cost: $0.05 again—cache invalidated by 1 character
# Solution: Keep static content truly static
system_prompt = "You are a helpful assistant." # Cached
user_context = "Current date: 2025-12-28" # Fresh
Impact: Azure OpenAI documentation confirms that “a single character difference in the first 1,024 tokens will result in a cache miss” learn.microsoft.com .
Problem: Caching prompts under 1,024 tokens actually increases costs due to overhead.
The math:
Cache write overhead: 10% premium on first request
Cache read cost: $0.30/1M tokens (vs $3.00 normal)
Break-even point: ~3 reads within TTL
Rule: Only cache when:
Prefix greater than or equal to 1,024 tokens (OpenAI/Azure)
Prefix greater than or equal to 1,024 tokens (Anthropic)
Prefix greater than or equal to 2,048 tokens (Google Gemini 2.5 Pro)
You expect greater than or equal to 5 requests within cache lifetime
Problem: Default cache lifetimes are short (5-10 minutes). Multi-turn conversations can break unexpectedly.
Verified TTLs:
OpenAI : 5-10 minutes (standard), 60 minutes (GPT-5.1 Pro extended)
Anthropic : 5 minutes default, configurable to 1 hour
Google : 15 minutes (configurable), 3-5 minutes average for implicit
Azure OpenAI : 5-10 minutes, cleared after 1 hour of inactivity
Mitigation: For long sessions, implement keep-alive requests every 4 minutes or use extended caching tiers.
Problem: Caching embeddings without size limits leads to Redis memory exhaustion.
Real failure: A startup cached every user query embedding indefinitely. Their Redis grew to 800GB in 3 weeks, costing $12,000/month in infrastructure before they hit OOM errors.
Solution:
cache.setex(key, ttl, value) # Always set TTL
maxmemory_policy = "allkeys-lru" # Redis config
Problem: Caching non-deterministic outputs or user-specific data.
Never cache:
Personalized recommendations
Real-time data (stock prices, weather)
Security-sensitive responses
Responses with random sampling (temperature greater than 0)
Safe to cache:
Static documentation
Code generation for common patterns
Data transformations (JSON to XML)
FAQ responses
Problem: Teams implement caching but don’t measure hit rates, so they don’t know if it’s working.
Minimum instrumentation:
# Track per-layer metrics
"prompt_cache_misses": 0,
"embedding_cache_hits": 0,
"embedding_cache_misses": 0,
"response_cache_hits": 0,
"response_cache_misses": 0,
Target metrics:
Prompt cache: greater than 70% hit rate
Embedding cache: greater than 50% hit rate
Response cache: greater than 40% hit rate (if used)
Use Case Prompt Cache Embedding Cache Response Cache Expected Savings Customer Support Bot High Medium High 60-80% RAG with Static Docs High High Low 70-85% Code Assistant High Low Medium 50-70% Data Analysis Low Medium Low 30-50% Chatbot (Multi-turn) High Low Low 40-60%
OpenAI / Azure OpenAI:
Automatic caching (no code changes)
Minimum: 1,024 tokens
TTL: 5-10 min (standard), 60 min (Pro tier)
Cost: 90% discount on cached tokens
Key : Structure prompts with stable prefix first
Anthropic (Claude):
Requires cache_control markers
Minimum: 1,024 tokens
TTL: 5 min default, 1 hour configurable
Cost: 1.25x write, 0.1x read
Key : Use cache_control on large text blocks
Google Gemini:
Implicit caching (2.5 Pro/Flash) or explicit cachedContent
Minimum: 1,028-2,048 tokens depending on model
TTL: 3-5 min average, 15 min configurable
Cost: 0.25x for cached tokens
Key : Push variations to end of prompt
✅ Time-based (TTL)
Short-lived data: 5-15 minutes
Medium-lived: 1-24 hours
Long-lived: 1 week+
✅ Event-driven
Database update → invalidate related cache keys
Document change → invalidate embedding cache
Policy update → invalidate response cache
✅ Pattern-based
Use cache key versioning: v1:prompt:hash
Bump version on breaking changes
Old keys auto-expire via TTL
Total Savings = (Token Volume × Hit Rate × Discount %) - Cache Overhead
- Token Volume = Total tokens processed
- Hit Rate = % of requests served from cache
- Discount % = 0.9 (90% savings typical)
- Cache Overhead = ~10% on first write
Example:
10M tokens/day
70% hit rate
90% discount
Savings: 10M × 0.7 × 0.9 = 6.3M tokens/day
At $3/1M: $18.90/day → $567/month saved
# Multi-Level Cache ROI Calculator
def calculate_cache_savings(
avg_tokens_per_query: int,
hit_rate_embedding: float,
hit_rate_response: float,
model_input_cost: float, # per 1M tokens
model_output_cost: float, # per 1M tokens
embedding_cost: float = 0.02 # per 1M tokens
Calculate monthly savings from three-layer caching.
All costs in USD per 1M tokens.
monthly_queries = daily_queries * 30
input_tokens_monthly = monthly_queries * avg_tokens_per_query
output_tokens_monthly = monthly_queries * (avg_tokens_per_query * 0.1) # Assume 10% output
# Base costs (no caching)
base_input_cost = (input_tokens_monthly / 1_000_000) * model_input_cost
base_output_cost = (output_tokens_monthly / 1_000_000) * model_output_cost
base_total = base_input_cost + base_output_cost
# Prompt cache: 90% savings on cached portion
prompt_cached = input_tokens_month
# Cache Performance Monitor
self.hits = {"prompt": 0, "embedding": 0, "response": 0}
self.misses = {"prompt": 0, "embedding": 0, "response": 0}
def record_hit(self, layer: str, tokens: int):
# Estimate savings: 90% of tokens * cost
self.tokens_saved += tokens
self.cost_saved += tokens * 0.000003 * 0.9 # $3/1M * 90%
def record_miss(self, layer: str):
def get_hit_rate(self, layer: str) -> float:
total = self.hits[layer] + self.misses[layer]
return self.hits[layer] / total
print("=== Cache Performance Report ===")
for layer in ["prompt", "embedding", "response"]:
rate = self.get_hit_rate(layer)
print(f"{layer.title()} Cache: {rate:.1%} hit rate")
print(f"Tokens Saved: {self.tokens_saved:,}")
print(f"Cost Saved: {self.cost_saved:.2f}")
Caching strategy selector + hit rate simulator
Interactive widget derived from “Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache” that lets readers explore caching strategy selector + hit rate simulator.
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.