Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache

A Series A startup recently discovered their RAG pipeline was costing $18,000 per month—90% of that spend was redundant context being sent to the same 200 user queries. After implementing a three-layer caching strategy, they reduced costs to $2,100 while improving response times by 65%. This guide will show you how to architect caching at multiple levels to achieve similar results.

Why Multi-Level Caching Matters

LLM inference costs follow a brutal exponential curve. At scale, even small inefficiencies compound into massive bills. The four verified pricing models from our research show the stakes:

Model	Input Cost/1M tokens	Output Cost/1M tokens	Context Window	Source
Claude 3.5 Sonnet	$3.00	$15.00	200K	Anthropic Docs
Haiku 3.5	$1.25	$5.00	200K	Anthropic Docs
GPT-4o	$5.00	$15.00	128K	OpenAI Pricing
GPT-4o-mini	$0.15	$0.60	128K	OpenAI Pricing

Consider a typical RAG application: 10,000 daily queries, average 5,000 tokens per request (context + response). Without caching:

Daily cost with GPT-4o: $1,000 input + $750 output = $1,750/day
Monthly cost: $52,500

With an 80% cache hit rate across all three layers:

Daily cost: $350 input + $150 output = $500/day
Monthly cost: $15,000
Savings: $37,500/month (71% reduction)

The math is undeniable: caching isn’t optional at scale—it’s survival.

Understanding the Three Cache Layers

1. Prompt Cache (System-Level)

Prompt caching stores entire conversation contexts or system prompts that remain constant across requests. This is the highest-value layer for most applications.

What gets cached:

System instructions (persona, rules, format requirements)
Few-shot examples (demonstration pairs)
RAG context that’s identical across queries
Long, static background documents

How it works: Modern LLM APIs support prompt caching by checking if the prefix of your prompt matches previously processed prompts. If you send a 10,000-token system prompt followed by a 100-token user query, the API only charges for the 100 tokens on cache hits after the first request.

Verified pricing impact: Anthropic’s prompt caching offers:

Write cost: $3.75/1M tokens (for cache writes)
Read cost: $0.30/1M tokens (for cache reads)
TTL: 5 minutes default, configurable up to 1 hour

Compare this to normal input pricing of $3.00/1M tokens for Claude 3.5 Sonnet. After 10 cache reads, you’ve already saved money.

2. Embedding Cache (Vector Retrieval Layer)

Embedding caching stores the vector representations of frequently queried text, eliminating redundant embedding API calls and vector database searches.

What gets cached:

User query embeddings
Retrieved document embeddings
Similarity search results
RAG context assemblies

The cost structure: Embedding models are cheaper than LLMs but still add up:

OpenAI text-embedding-3-small: $0.02/1M tokens
OpenAI text-embedding-3-large: $0.13/1M tokens
Cohere embed-english-v3: $0.10/1M tokens

For 10,000 daily queries averaging 500 tokens each:

Without cache: 5M tokens/day × $0.02 = $100/day
With 70% hit rate: $30/day
Annual savings: $25,550

But the real win is latency. Vector searches add 50-200ms per query. Cache hits reduce this to less than 5ms.

3. Response Cache (Output Layer)

Response caching stores complete LLM outputs for identical or semantically similar queries. This is the most aggressive caching strategy but offers massive savings for deterministic tasks.

What gets cached:

Exact query matches
Semantic similarity hits (greater than 95% similarity)
Structured data generation (JSON, SQL, etc.)
Code generation for common patterns

Implementation considerations: Response caching requires careful invalidation logic. Stale responses can cause:

Incorrect recommendations
Security vulnerabilities
Compliance issues

When to use it:

Customer support FAQs
Documentation generation
Data transformation tasks
Code review comments

Practical Implementation

Instrument your application to track query patterns, repetition rates, and token usage across all three layers. Without data, you’re guessing.
Implement prompt caching first—it’s the lowest-hanging fruit. Most modern APIs support it natively with simple flags.
Add embedding cache using Redis or your vector DB’s native caching. Cache query-document pairs for 24-48 hours.
Deploy response cache only after measuring query similarity. Start with exact matches, then expand to semantic similarity.
Set up cache invalidation based on your data freshness requirements. Use TTLs and event-driven invalidation.
Monitor hit rates and costs continuously. Aim for greater than 70% hit rate on prompt cache, greater than 50% on embedding cache.

import hashlib
import json
from typing import Optional, Dict, Any
from dataclasses import dataclass
import redis
from openai import OpenAI

@dataclass
class CacheConfig:
    """Configuration for multi-level caching"""
    prompt_ttl: int = 300  # 5 minutes
    embedding_ttl: int = 86400  # 24 hours
    response_ttl: int = 3600  # 1 hour
    similarity_threshold: float = 0.95

class MultiLevelCache:
    def __init__(self, redis_client: redis.Redis, openai_client: OpenAI):
        self.redis = redis_client
        self.openai = openai_client
        self.config = CacheConfig()

    def _hash_prompt(self, prompt: str) -> str:
        """Create deterministic hash for prompt prefix"""
        return hashlib.sha256(prompt.encode()).hexdigest()

    async def get_prompt_cache(self, system_prompt: str, user_query: str) -> Optional[str]:
        """
        Layer 1: Prompt Cache
        Caches system prompt + query prefix for 5 minutes
        """
        # For prompt caching, we cache the full context assembly
        cache_key = f"prompt:v1:{self._hash_prompt(system_prompt + user_query[:100])}"

        cached = self.redis.get(cache_key)
        if cached:
            return cached.decode()

        # Cache miss - generate and store
        response = self.openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_query}
            ]
        )

        result = response.choices[0].message.content
        self.redis.setex(cache_key, self.config.prompt_ttl, result)
        return result

    async def get_embedding_cache(self, text: str, doc_ids: list) -> Optional[list]:
        """
        Layer 2: Embedding Cache
        Caches query embeddings + retrieved document IDs for 24 hours
        """
        # Hash the text to create cache key
        text_hash = hashlib.sha256(text.encode()).hexdigest()
        cache_key = f"embedding:v1:{text_hash}"

        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)

        # Cache miss - generate embedding
        embedding = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )

        # In production, you'd query your vector DB here
        # For demo, we simulate retrieval
        retrieved_docs = self._vector_search(embedding.data[0].embedding, doc_ids)

        # Cache the retrieved doc IDs (not full text)
        self.redis.setex(
            cache_key,
            self.config.embedding_ttl,
            json.dumps(retrieved_docs)
        )

        return retrieved_docs

    async def get_response_cache(self, query: str) -> Optional[str]:
        """
        Layer 3: Response Cache
        Caches exact query matches for 1 hour
        """
        cache_key = f"response:v1:{self._hash_prompt(query)}"

        cached = self.redis.get(cache_key)
        if cached:
            return cached.decode()

        return None

    async def set_response_cache(self, query: str, response: str):
        """Store response in cache"""
        cache_key = f"response:v1:{self._hash_prompt(query)}"
        self.redis.setex(cache_key, self.config.response_ttl, response)

    def _vector_search(self, embedding: list, doc_ids: list) -> list:
        """Simulated vector search - replace with actual DB query"""
        # In production: query Pinecone, Weaviate, Qdrant, etc.
        return doc_ids[:5]  # Return top 5 matches

# Usage example
async def process_query(system_prompt: str, user_query: str, doc_ids: list):
    cache = MultiLevelCache(
        redis_client=redis.Redis(host='localhost', port=6379),
        openai_client=OpenAI()
    )

    # Check response cache first (cheapest)
    cached_response = await cache.get_response_cache(user_query)
    if cached_response:
        return {"source": "response_cache", "data": cached_response}

    # Check prompt cache
    prompt_result = await cache.get_prompt_cache(system_prompt, user_query)

    # Check embedding cache during RAG
    retrieved_docs = await cache.get_embedding_cache(user_query, doc_ids)

    # Store final response
    await cache.set_response_cache(user_query, prompt_result)

    return {
        "source": "fresh",
        "response": prompt_result,
        "docs": retrieved_docs
    }

import crypto from 'crypto';
import Redis from 'ioredis';
import OpenAI from 'openai';

interface CacheConfig {
  promptTTL: number;      // seconds
  embeddingTTL: number;   // seconds
  responseTTL: number;    // seconds
  similarityThreshold: number;
}

interface CacheResult<T> {
  hit: boolean;
  data?: T;
  latencyMs: number;
}

class MultiLevelCache {
  private redis: Redis;
  private openai: OpenAI;
  private config: CacheConfig;

  constructor(redisUrl: string, openaiKey: string) {
    this.redis = new Redis(redisUrl);
    this.openai = new OpenAI({ apiKey: openaiKey });
    this.config = {
      promptTTL: 300,
      embeddingTTL: 86400,
      responseTTL: 3600,
      similarityThreshold: 0.95
    };
  }

  private hashPrompt(text: string): string {
    return crypto.createHash('sha256').update(text).digest('hex');
  }

  async getCachedResponse(query: string): Promise<CacheResult<string>> {
    const start = Date.now();
    const cacheKey = `response:v1:${this.hashPrompt(query)}`;

    const cached = await this.redis.get(cacheKey);
    if (cached) {
      return {
        hit: true,
        data: cached,
        latencyMs: Date.now() - start
      };
    }

    return { hit: false, latencyMs: Date.now() - start };
  }

  async cacheResponse(query: string, response: string): Promise<void> {
    const cacheKey = `response:v1:${this.hashPrompt(query)}`;
    await this.redis.setex(cacheKey, this.config.responseTTL, response);
  }

  async getCachedPrompt(
    systemPrompt: string,
    userQuery: string
  ): Promise<CacheResult<string>> {
    const start = Date.now();
    const cacheKey = `prompt:v1:${this.hashPrompt(systemPrompt + userQuery.slice(0, 100))}`;

    const cached = await this.redis.get(cacheKey);
    if (cached) {
      return {
        hit: true,
        data: cached,
        latencyMs: Date.now() - start
      };
    }

    // Cache miss - generate
    const response = await this.openai.chat.completions.create({
      model: "gpt-4o",
      messages: [
        { role: "system", content: systemPrompt },
        { role: "user", content: userQuery }
      ]
    });

    const result = response.choices[0].message.content || "";
    await this.redis.setex(cacheKey, this.config.promptTTL, result);

    return {
      hit: false,
      data: result,
      latencyMs: Date.now() - start
    };
  }

  async getEmbeddedDocs(query: string, docIds: string[]): Promise<string[]> {
    const hash = this.hashPrompt(query);
    const cacheKey = `embedding:v1:${hash}`;

    const cached = await this.redis.get(cacheKey);
    if (cached) {
      return JSON.parse(cached);
    }

    // Generate embedding
    const embedding = await this.openai.embeddings.create({
      model: "text-embedding-3-small",
      input: query
    });

    // Simulate vector search
    const retrieved = docIds.slice(0, 5);
    await this.redis.setex(cacheKey, this.config.embeddingTTL, JSON.stringify(retrieved));

    return retrieved;
  }
}

Common Pitfalls

1. The “Prefix Mutation” Trap

Problem: Changing a single character in the first 1,024 tokens invalidates the entire prompt cache.

Real-world example:

# Day 1: Cache miss (first request)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-27."
# Cost: $0.05 for 2,000 tokens

# Day 2: Cache miss (date changed)
system_prompt = "You are a helpful assistant. Today's date is 2025-12-28."
# Cost: $0.05 again—cache invalidated by 1 character

# Solution: Keep static content truly static
system_prompt = "You are a helpful assistant."  # Cached
user_context = "Current date: 2025-12-28"  # Fresh

Impact: Azure OpenAI documentation confirms that “a single character difference in the first 1,024 tokens will result in a cache miss” learn.microsoft.com.

2. The “False Economy” of Micro-Caching

Problem: Caching prompts under 1,024 tokens actually increases costs due to overhead.

The math:

Cache write overhead: 10% premium on first request
Cache read cost: $0.30/1M tokens (vs $3.00 normal)
Break-even point: ~3 reads within TTL

Rule: Only cache when:

Prefix greater than or equal to 1,024 tokens (OpenAI/Azure)
Prefix greater than or equal to 1,024 tokens (Anthropic)
Prefix greater than or equal to 2,048 tokens (Google Gemini 2.5 Pro)
You expect greater than or equal to 5 requests within cache lifetime

3. The “TTL Time Bomb”

Problem: Default cache lifetimes are short (5-10 minutes). Multi-turn conversations can break unexpectedly.

Verified TTLs:

OpenAI: 5-10 minutes (standard), 60 minutes (GPT-5.1 Pro extended)
Anthropic: 5 minutes default, configurable to 1 hour
Google: 15 minutes (configurable), 3-5 minutes average for implicit
Azure OpenAI: 5-10 minutes, cleared after 1 hour of inactivity

Mitigation: For long sessions, implement keep-alive requests every 4 minutes or use extended caching tiers.

4. The “Embedding Cache Explosion”

Problem: Caching embeddings without size limits leads to Redis memory exhaustion.

Real failure: A startup cached every user query embedding indefinitely. Their Redis grew to 800GB in 3 weeks, costing $12,000/month in infrastructure before they hit OOM errors.

Solution:

# Implement LRU eviction
cache.setex(key, ttl, value)  # Always set TTL
max_memory = "2gb"
maxmemory_policy = "allkeys-lru"  # Redis config

5. The “Response Cache Poisoning”

Problem: Caching non-deterministic outputs or user-specific data.

Never cache:

Personalized recommendations
Real-time data (stock prices, weather)
Security-sensitive responses
Responses with random sampling (temperature greater than 0)

Safe to cache:

Static documentation
Code generation for common patterns
Data transformations (JSON to XML)
FAQ responses

6. The “Missing Cache Hit Rate” Blindspot

Problem: Teams implement caching but don’t measure hit rates, so they don’t know if it’s working.

Minimum instrumentation:

# Track per-layer metrics
metrics = {
    "prompt_cache_hits": 0,
    "prompt_cache_misses": 0,
    "embedding_cache_hits": 0,
    "embedding_cache_misses": 0,
    "response_cache_hits": 0,
    "response_cache_misses": 0,
    "tokens_saved": 0,
    "cost_saved": 0
}

Target metrics:

Prompt cache: greater than 70% hit rate
Embedding cache: greater than 50% hit rate
Response cache: greater than 40% hit rate (if used)

Quick Reference

Cache Layer Decision Matrix

Use Case	Prompt Cache	Embedding Cache	Response Cache	Expected Savings
Customer Support Bot	High	Medium	High	60-80%
RAG with Static Docs	High	High	Low	70-85%
Code Assistant	High	Low	Medium	50-70%
Data Analysis	Low	Medium	Low	30-50%
Chatbot (Multi-turn)	High	Low	Low	40-60%

Provider-Specific Requirements

OpenAI / Azure OpenAI:

Automatic caching (no code changes)
Minimum: 1,024 tokens
TTL: 5-10 min (standard), 60 min (Pro tier)
Cost: 90% discount on cached tokens
Key: Structure prompts with stable prefix first

Anthropic (Claude):

Requires cache_control markers
Minimum: 1,024 tokens
TTL: 5 min default, 1 hour configurable
Cost: 1.25x write, 0.1x read
Key: Use cache_control on large text blocks

Google Gemini:

Implicit caching (2.5 Pro/Flash) or explicit cachedContent
Minimum: 1,028-2,048 tokens depending on model
TTL: 3-5 min average, 15 min configurable
Cost: 0.25x for cached tokens
Key: Push variations to end of prompt

Cache Invalidation Checklist

✅ Time-based (TTL)

Short-lived data: 5-15 minutes
Medium-lived: 1-24 hours
Long-lived: 1 week+

✅ Event-driven

Database update → invalidate related cache keys
Document change → invalidate embedding cache
Policy update → invalidate response cache

✅ Pattern-based

Use cache key versioning: v1:prompt:hash
Bump version on breaking changes
Old keys auto-expire via TTL

Cost Calculation Formula

Total Savings = (Token Volume × Hit Rate × Discount %) - Cache Overhead

Where:
- Token Volume = Total tokens processed
- Hit Rate = % of requests served from cache
- Discount % = 0.9 (90% savings typical)
- Cache Overhead = ~10% on first write

Example:

10M tokens/day
70% hit rate
90% discount
Savings: 10M × 0.7 × 0.9 = 6.3M tokens/day
At $3/1M: $18.90/day → $567/month saved

# Multi-Level Cache ROI Calculator

def calculate_cache_savings(
  daily_queries: int,
  avg_tokens_per_query: int,
  hit_rate_prompt: float,
  hit_rate_embedding: float,
  hit_rate_response: float,
  model_input_cost: float,  # per 1M tokens
  model_output_cost: float,  # per 1M tokens
  embedding_cost: float = 0.02  # per 1M tokens
) -> dict:
  """
  Calculate monthly savings from three-layer caching.
  All costs in USD per 1M tokens.
  """

  # Monthly token volume
  monthly_queries = daily_queries * 30
  input_tokens_monthly = monthly_queries * avg_tokens_per_query
  output_tokens_monthly = monthly_queries * (avg_tokens_per_query * 0.1)  # Assume 10% output

  # Base costs (no caching)
  base_input_cost = (input_tokens_monthly / 1_000_000) * model_input_cost
  base_output_cost = (output_tokens_monthly / 1_000_000) * model_output_cost
  base_total = base_input_cost + base_output_cost

  # Cached costs
  # Prompt cache: 90% savings on cached portion
  prompt_cached = input_tokens_month

# Cache Performance Monitor

class CacheMetrics:
  def __init__(self):
      self.hits = {"prompt": 0, "embedding": 0, "response": 0}
      self.misses = {"prompt": 0, "embedding": 0, "response": 0}
      self.tokens_saved = 0
      self.cost_saved = 0

  def record_hit(self, layer: str, tokens: int):
      self.hits[layer] += 1
      # Estimate savings: 90% of tokens * cost
      self.tokens_saved += tokens
      self.cost_saved += tokens * 0.000003 * 0.9  # $3/1M * 90%

  def record_miss(self, layer: str):
      self.misses[layer] += 1

  def get_hit_rate(self, layer: str) -> float:
      total = self.hits[layer] + self.misses[layer]
      if total == 0:
          return 0.0
      return self.hits[layer] / total

  def report(self):
      print("=== Cache Performance Report ===")
      for layer in ["prompt", "embedding", "response"]:
          rate = self.get_hit_rate(layer)
          print(f"{layer.title()} Cache: {rate:.1%} hit rate")
      print(f"Tokens Saved: {self.tokens_saved:,}")
      print(f"Cost Saved: {self.cost_saved:.2f}")

Caching strategy selector + hit rate simulator

Interactive widget derived from “Caching at Multiple Levels: Prompt Cache, Embedding Cache, Response Cache” that lets readers explore caching strategy selector + hit rate simulator.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.