Skip to content
GitHubX/TwitterRSS

Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache

Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache

Section titled “Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache”

A Series A startup recently discovered their RAG pipeline was costing $18,000 per month—90% of that spend was redundant context being sent to the same 200 user queries. After implementing a three-layer caching strategy, they reduced costs to $2,100 while improving response times by 65%. This guide will show you how to architect caching at multiple levels to achieve similar results.

LLM inference costs follow a brutal exponential curve. At scale, even small inefficiencies compound into massive bills. The four verified pricing models from our research show the stakes:

ModelInput Cost/1M tokensOutput Cost/1M tokensContext WindowSource
Claude 3.5 Sonnet$3.00$15.00200KAnthropic Docs
Haiku 3.5$1.25$5.00200KAnthropic Docs
GPT-4o$5.00$15.00128KOpenAI Pricing
GPT-4o-mini$0.15$0.60128KOpenAI Pricing

Consider a typical RAG application: 10,000 daily queries, average 5,000 tokens per request (context + response). Without caching:

  • Daily cost with GPT-4o: $1,000 input + $750 output = $1,750/day
  • Monthly cost: $52,500

With an 80% cache hit rate across all three layers:

  • Daily cost: $350 input + $150 output = $500/day
  • Monthly cost: $15,000
  • Savings: $37,500/month (71% reduction)

The math is undeniable: caching isn’t optional at scale—it’s survival.

Prompt caching stores entire conversation contexts or system prompts that remain constant across requests. This is the highest-value layer for most applications.

What gets cached:

  • System instructions (persona, rules, format requirements)
  • Few-shot examples (demonstration pairs)
  • RAG context that’s identical across queries
  • Long, static background documents

How it works: Modern LLM APIs support prompt caching by checking if the prefix of your prompt matches previously processed prompts. If you send a 10,000-token system prompt followed by a 100-token user query, the API only charges for the 100 tokens on cache hits after the first request.

Verified pricing impact: Anthropic’s prompt caching offers:

  • Write cost: $3.75/1M tokens (for cache writes)
  • Read cost: $0.30/1M tokens (for cache reads)
  • TTL: 5 minutes default, configurable up to 1 hour

Compare this to normal input pricing of $3.00/1M tokens for Claude 3.5 Sonnet. After 10 cache reads, you’ve already saved money.

2. Embedding Cache (Vector Retrieval Layer)

Section titled “2. Embedding Cache (Vector Retrieval Layer)”

Embedding caching stores the vector representations of frequently queried text, eliminating redundant embedding API calls and vector database searches.

What gets cached:

  • User query embeddings
  • Retrieved document embeddings
  • Similarity search results
  • RAG context assemblies

The cost structure: Embedding models are cheaper than LLMs but still add up:

  • OpenAI text-embedding-3-small: $0.02/1M tokens
  • OpenAI text-embedding-3-large: $0.13/1M tokens
  • Cohere embed-english-v3: $0.10/1M tokens

For 10,000 daily queries averaging 500 tokens each:

  • Without cache: 5M tokens/day × $0.02 = $100/day
  • With 70% hit rate: $30/day
  • Annual savings: $25,550

But the real win is latency. Vector searches add 50-200ms per query. Cache hits reduce this to less than 5ms.

Response caching stores complete LLM outputs for identical or semantically similar queries. This is the most aggressive caching strategy but offers massive savings for deterministic tasks.

What gets cached:

  • Exact query matches
  • Semantic similarity hits (greater than 95% similarity)
  • Structured data generation (JSON, SQL, etc.)
  • Code generation for common patterns

Implementation considerations: Response caching requires careful invalidation logic. Stale responses can cause:

  • Incorrect recommendations
  • Security vulnerabilities
  • Compliance issues

When to use it:

  • Customer support FAQs
  • Documentation generation
  • Data transformation tasks
  • Code review comments
  1. Instrument your application to track query patterns, repetition rates, and token usage across all three layers. Without data, you’re guessing.

  2. Implement prompt caching first—it’s the lowest-hanging fruit. Most modern APIs support it natively with simple flags.

  3. Add embedding cache using Redis or your vector DB’s native caching. Cache query-document pairs for 24-48 hours.

  4. Deploy response cache only after measuring query similarity. Start with exact matches, then expand to semantic similarity.

  5. Set up cache invalidation based on your data freshness requirements. Use TTLs and event-driven invalidation.

  6. Monitor hit rates and costs continuously. Aim for greater than 70% hit rate on prompt cache, greater than 50% on embedding cache.

import hashlib
import json
from typing import Optional, Dict, Any
from dataclasses import dataclass
import redis
from openai import OpenAI
@dataclass
class CacheConfig:
"""Configuration for multi-level caching"""
prompt_ttl: int = 300 # 5 minutes
embedding_ttl: int = 86400 # 24 hours
response_ttl: int = 3600 # 1 hour
similarity_threshold: float = 0.95
class MultiLevelCache:
def __init__(self, redis_client: redis.Redis, openai_client: OpenAI):
self.redis = redis_client
self.openai = openai_client
self.config = CacheConfig()
def _hash_prompt(self, prompt: str) -> str:
"""Create deterministic hash for prompt prefix"""
return hashlib.sha256(prompt.encode()).hexdigest()
async def get_prompt_cache(self, system_prompt: str, user_query: str) -> Optional[str]:
"""
Layer 1: Prompt Cache
Caches system prompt + query prefix for 5 minutes
"""
# For prompt caching, we cache the full context assembly
cache_key = f"prompt:v1:{self._hash_prompt(system_prompt + user_query[:100])}"
cached = self.redis.get(cache_key)
if cached:
return cached.decode()
# Cache miss - generate and store
response = self.openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
]
)
result = response.choices[0].message.content
self.redis.setex(cache_key, self.config.prompt_ttl, result)
return result
async def get_embedding_cache(self, text: str, doc_ids: list) -> Optional[list]:
"""
Layer 2: Embedding Cache
Caches query embeddings + retrieved document IDs for 24 hours
"""
# Hash the text to create cache key
text_hash = hashlib.sha256(text.encode()).hexdigest()
cache_key = f"embedding:v1:{text_hash}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss - generate embedding
embedding = self.openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
# In production, you'd query your vector DB here
# For demo, we simulate retrieval
retrieved_docs = self._vector_search(embedding.data[0].embedding, doc_ids)
# Cache the retrieved doc IDs (not full text)
self.redis.setex(
cache_key,
self.config.embedding_ttl,
json.dumps(retrieved_docs)
)
return retrieved_docs
async def get_response_cache(self, query: str) -> Optional[str]:
"""
Layer 3: Response Cache
Caches exact query matches for 1 hour
"""
cache_key = f"response:v1:{self._hash_prompt(query)}"
cached = self.redis.get(cache_key)
if cached:
return cached.decode()
return None
async def set_response_cache(self, query: str, response: str):
"""Store response in cache"""
cache_key = f"response:v1:{self._hash_prompt(query)}"
self.redis.setex(cache_key, self.config.response_ttl, response)
def _vector_search(self, embedding: list, doc_ids: list) -> list:
"""Simulated vector search - replace with actual DB query"""
# In production: query Pinecone, Weaviate, Qdrant, etc.
return doc_ids[:5] # Return top 5 matches
# Usage example
async def process_query(system_prompt: str, user_query: str, doc_ids: list):
cache = MultiLevelCache(
redis_client=redis.Redis(host='localhost', port=6379),
openai_client=OpenAI()
)
# Check response cache first (cheapest)
cached_response = await cache.get_response_cache(user_query)
if cached_response:
return {"source": "response_cache", "data": cached_response}
# Check prompt cache
prompt_result = await cache.get_prompt_cache(system_prompt, user_query)
# Check embedding cache during RAG
retrieved_docs = await cache.get_embedding_cache(user_query, doc_ids)
# Store final response
await cache.set_response_cache(user_query, prompt_result)
return {
"source": "fresh",
"response": prompt_result,
"docs": retrieved_docs
}

Problem: Changing a single character in the first 1,024 tokens invalidates the entire prompt cache.

Real-world example:

# Day 1: Cache miss (first request)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-27."
# Cost: $0.05 for 2,000 tokens
# Day 2: Cache miss (date changed)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-28."
# Cost: $0.05 again—cache invalidated by 1 character
# Solution: Keep static content truly static
system_prompt = "You are a helpful assistant." # Cached
user_context = "Current date: 2025-12-28" # Fresh

Impact: Azure OpenAI documentation confirms that “a single character difference in the first 1,024 tokens will result in a cache miss” learn.microsoft.com.

2. The “False Economy” of Micro-Caching

Section titled “2. The “False Economy” of Micro-Caching”

Problem: Caching prompts under 1,024 tokens actually increases costs due to overhead.

The math:

  • Cache write overhead: 10% premium on first request
  • Cache read cost: $0.30/1M tokens (vs $3.00 normal)
  • Break-even point: ~3 reads within TTL

Rule: Only cache when:

  • Prefix greater than or equal to 1,024 tokens (OpenAI/Azure)
  • Prefix greater than or equal to 1,024 tokens (Anthropic)
  • Prefix greater than or equal to 2,048 tokens (Google Gemini 2.5 Pro)
  • You expect greater than or equal to 5 requests within cache lifetime

Problem: Default cache lifetimes are short (5-10 minutes). Multi-turn conversations can break unexpectedly.

Verified TTLs:

  • OpenAI: 5-10 minutes (standard), 60 minutes (GPT-5.1 Pro extended)
  • Anthropic: 5 minutes default, configurable to 1 hour
  • Google: 15 minutes (configurable), 3-5 minutes average for implicit
  • Azure OpenAI: 5-10 minutes, cleared after 1 hour of inactivity

Mitigation: For long sessions, implement keep-alive requests every 4 minutes or use extended caching tiers.

Problem: Caching embeddings without size limits leads to Redis memory exhaustion.

Real failure: A startup cached every user query embedding indefinitely. Their Redis grew to 800GB in 3 weeks, costing $12,000/month in infrastructure before they hit OOM errors.

Solution:

# Implement LRU eviction
cache.setex(key, ttl, value) # Always set TTL
max_memory = "2gb"
maxmemory_policy = "allkeys-lru" # Redis config

Problem: Caching non-deterministic outputs or user-specific data.

Never cache:

  • Personalized recommendations
  • Real-time data (stock prices, weather)
  • Security-sensitive responses
  • Responses with random sampling (temperature greater than 0)

Safe to cache:

  • Static documentation
  • Code generation for common patterns
  • Data transformations (JSON to XML)
  • FAQ responses

6. The “Missing Cache Hit Rate” Blindspot

Section titled “6. The “Missing Cache Hit Rate” Blindspot”

Problem: Teams implement caching but don’t measure hit rates, so they don’t know if it’s working.

Minimum instrumentation:

# Track per-layer metrics
metrics = {
"prompt_cache_hits": 0,
"prompt_cache_misses": 0,
"embedding_cache_hits": 0,
"embedding_cache_misses": 0,
"response_cache_hits": 0,
"response_cache_misses": 0,
"tokens_saved": 0,
"cost_saved": 0
}

Target metrics:

  • Prompt cache: greater than 70% hit rate
  • Embedding cache: greater than 50% hit rate
  • Response cache: greater than 40% hit rate (if used)
Use CasePrompt CacheEmbedding CacheResponse CacheExpected Savings
Customer Support BotHighMediumHigh60-80%
RAG with Static DocsHighHighLow70-85%
Code AssistantHighLowMedium50-70%
Data AnalysisLowMediumLow30-50%
Chatbot (Multi-turn)HighLowLow40-60%

OpenAI / Azure OpenAI:

  • Automatic caching (no code changes)
  • Minimum: 1,024 tokens
  • TTL: 5-10 min (standard), 60 min (Pro tier)
  • Cost: 90% discount on cached tokens
  • Key: Structure prompts with stable prefix first

Anthropic (Claude):

  • Requires cache_control markers
  • Minimum: 1,024 tokens
  • TTL: 5 min default, 1 hour configurable
  • Cost: 1.25x write, 0.1x read
  • Key: Use cache_control on large text blocks

Google Gemini:

  • Implicit caching (2.5 Pro/Flash) or explicit cachedContent
  • Minimum: 1,028-2,048 tokens depending on model
  • TTL: 3-5 min average, 15 min configurable
  • Cost: 0.25x for cached tokens
  • Key: Push variations to end of prompt

Time-based (TTL)

  • Short-lived data: 5-15 minutes
  • Medium-lived: 1-24 hours
  • Long-lived: 1 week+

Event-driven

  • Database update → invalidate related cache keys
  • Document change → invalidate embedding cache
  • Policy update → invalidate response cache

Pattern-based

  • Use cache key versioning: v1:prompt:hash
  • Bump version on breaking changes
  • Old keys auto-expire via TTL
Total Savings = (Token Volume × Hit Rate × Discount %) - Cache Overhead
Where:
- Token Volume = Total tokens processed
- Hit Rate = % of requests served from cache
- Discount % = 0.9 (90% savings typical)
- Cache Overhead = ~10% on first write

Example:

  • 10M tokens/day
  • 70% hit rate
  • 90% discount
  • Savings: 10M × 0.7 × 0.9 = 6.3M tokens/day
  • At $3/1M: $18.90/day → $567/month saved
Multi-Level Cache ROI Calculator
# Multi-Level Cache ROI Calculator
def calculate_cache_savings(
daily_queries: int,
avg_tokens_per_query: int,
hit_rate_prompt: float,
hit_rate_embedding: float,
hit_rate_response: float,
model_input_cost: float, # per 1M tokens
model_output_cost: float, # per 1M tokens
embedding_cost: float = 0.02 # per 1M tokens
) -> dict:
"""
Calculate monthly savings from three-layer caching.
All costs in USD per 1M tokens.
"""
# Monthly token volume
monthly_queries = daily_queries * 30
input_tokens_monthly = monthly_queries * avg_tokens_per_query
output_tokens_monthly = monthly_queries * (avg_tokens_per_query * 0.1) # Assume 10% output
# Base costs (no caching)
base_input_cost = (input_tokens_monthly / 1_000_000) * model_input_cost
base_output_cost = (output_tokens_monthly / 1_000_000) * model_output_cost
base_total = base_input_cost + base_output_cost
# Cached costs
# Prompt cache: 90% savings on cached portion
prompt_cached = input_tokens_month
Cache Performance Monitor
# Cache Performance Monitor
class CacheMetrics:
def __init__(self):
self.hits = {"prompt": 0, "embedding": 0, "response": 0}
self.misses = {"prompt": 0, "embedding": 0, "response": 0}
self.tokens_saved = 0
self.cost_saved = 0
def record_hit(self, layer: str, tokens: int):
self.hits[layer] += 1
# Estimate savings: 90% of tokens * cost
self.tokens_saved += tokens
self.cost_saved += tokens * 0.000003 * 0.9 # $3/1M * 90%
def record_miss(self, layer: str):
self.misses[layer] += 1
def get_hit_rate(self, layer: str) -> float:
total = self.hits[layer] + self.misses[layer]
if total == 0:
return 0.0
return self.hits[layer] / total
def report(self):
print("=== Cache Performance Report ===")
for layer in ["prompt", "embedding", "response"]:
rate = self.get_hit_rate(layer)
print(f"{layer.title()} Cache: {rate:.1%} hit rate")
print(f"Tokens Saved: {self.tokens_saved:,}")
print(f"Cost Saved: {self.cost_saved:.2f}")

Caching strategy selector + hit rate simulator

Interactive widget derived from “Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache” that lets readers explore caching strategy selector + hit rate simulator.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.