Token Compression Techniques: Reducing Context Length by 20-40%

Every token you send to a language model costs money, yet most production systems transmit verbose context without compression. A single RAG pipeline processing 10,000 documents daily can burn through $15,000 per month in unnecessary input tokens alone. Token compression techniques—summarization, semantic compression, and attention-based pruning—can reduce context length by 20-40% while preserving accuracy, directly impacting your bottom line.

Why Token Costs Matter

Token costs follow a brutal multiplicative pattern. Consider a typical RAG application: you send a system prompt (500 tokens), 5 retrieved documents (2,000 tokens each = 10,000 tokens), conversation history (3,000 tokens), and the user query (100 tokens). That’s 13,600 input tokens per request. At 50,000 requests per day with GPT-4o ($5.00/1M input tokens), you’re spending $3,400 daily or $102,000 monthly. A 30% compression reduces this to $71,400—saving $30,600 monthly.

The challenge is that naive compression (like simple truncation) destroys accuracy. Sophisticated compression maintains semantic meaning while eliminating redundancy. This article covers three proven techniques: summarization (condensing content while preserving meaning), semantic compression (embedding-based deduplication), and attention-based pruning (removing low-impact tokens).

The Hidden Costs of Uncompressed Context

Beyond direct API costs, uncompressed context creates cascading expenses:

Latency: Larger contexts increase time-to-first-token (TTFT) by 50-200%
Throughput: More tokens per request reduces requests-per-second capacity
Retry costs: Failed requests due to context length limits burn tokens without results
Storage: Storing uncompressed conversation history for compliance multiplies database costs

Core Compression Techniques

1. Summarization-Based Compression

Summarization uses a smaller, faster model to condense context before sending it to your primary model. This is ideal for long documents, conversation history, and retrieved context.

Implementation Strategy

The pattern is straightforward: intercept your context, summarize it, then use the summary instead of the original. For RAG systems, summarize each retrieved document individually before concatenation. For conversation history, summarize completed turns.

Python Implementation:

from anthropic import Anthropic
import asyncio

class ContextCompressor:
    def __init__(self, model="claude-3-5-haiku-20241022"):
        self.client = Anthropic()
        self.model = model
        self.summary_prompt = """Summarize the following text, preserving all key facts, figures, and actionable insights.
        Keep the summary at ~30% of original length. Return only the summary text.

        Text to summarize:
        {text}"""

    async def compress_document(self, text: str, max_tokens: int = 1000) -> str:
        """Compress a single document using a lightweight model."""
        response = self.client.messages.create(
            model=self.model,
            max_tokens=max_tokens,
            temperature=0.1,
            messages=[{
                "role": "user",
                "content": self.summary_prompt.format(text=text)
            }]
        )
        return response.content[0].text

    async def compress_rag_context(self, documents: list[str]) -> str:
        """Compress multiple RAG documents in parallel."""
        tasks = [self.compress_document(doc) for doc in documents]
        summaries = await asyncio.gather(*tasks)
        return "\n\n".join(summaries)

# Usage example
compressor = ContextCompressor()

# Original context: 5 documents × 2000 tokens = 10,000 tokens
documents = [
    "Long technical documentation about API endpoints...",
    "Customer support transcript with multiple turns...",
    "Product specification with detailed requirements...",
    "Research paper with methodology and results...",
    "Financial report with quarterly metrics..."
]

# Compressed context: ~3,000 tokens (70% reduction)
compressed_context = await compressor.compress_rag_context(documents)

Cost Analysis:

Before: 10,000 tokens × $5.00/1M = $0.05 per request
After: 3,000 tokens × $5.00/1M = $0.015 per request
Savings: 70% reduction in input costs

2. Semantic Compression with Embeddings

Semantic compression eliminates redundant information by clustering similar content and keeping only representative examples. This is particularly effective for conversation history and multi-document retrieval.

How It Works

Split context into semantic chunks
Generate embeddings for each chunk
Cluster chunks by similarity
Keep one representative per cluster
Reconstruct compressed context

Implementation:

from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

class SemanticCompressor:
    def __init__(self):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.clustering = AgglomerativeClustering(
            n_clusters=None,
            distance_threshold=0.8
        )

    def compress(self, text_chunks: list[str]) -> str:
        """Compress by removing semantic duplicates."""
        # Generate embeddings
        embeddings = self.encoder.encode(text_chunks)

        # Cluster similar chunks
        clusters = self.clustering.fit_predict(embeddings)

        # Keep one chunk per cluster
        compressed = []
        for cluster_id in np.unique(clusters):
            cluster_mask = clusters == cluster_id
            # Keep the first chunk (or most central)
            cluster_indices = np.where(cluster_mask)[0]
            compressed.append(text_chunks[cluster_indices[0]])

        return "\n".join(compressed)

# Example: Conversation history compression
conversation = [
    "User: How do I implement token compression?",
    "Assistant: You can use summarization or semantic compression...",
    "User: Which is better for RAG?",
    "Assistant: For RAG, summarization works well for documents...",
    "User: What about cost?",
    "Assistant: Compression can save 20-40% on token costs..."
]

compressor = SemanticCompressor()
compressed = compressor.compress(conversation)
# Result: Removes repetitive explanations, keeps key points

3. Attention-Based Pruning

Attention-based pruning removes tokens that contribute least to the model’s understanding. This requires analyzing attention weights or using gradient-based importance scores.

Simplified Approach:

def prune_low_attention_tokens(text: str, attention_scores: list[float], threshold: float = 0.1) -> str:
    """Remove tokens with attention scores below threshold."""
    tokens = text.split()
    pruned_tokens = [
        token for token, score in zip(tokens, attention_scores)
        if score >= threshold
    ]
    return " ".join(pruned_tokens)

# In practice, you'd extract attention scores from your model
# For most applications, use keyword-based filtering as a proxy:
def keyword_prune(text: str, important_keywords: list[str]) -> str:
    """Keep sentences containing important keywords."""
    sentences = text.split('. ')
    kept = [
        s for s in sentences
        if any(kw.lower() in s.lower() for kw in important_keywords)
    ]
    return '. '.join(kept)

Code Example

Complete RAG Pipeline with Compression

Here’s a production-ready example that integrates compression into a RAG workflow:

import asyncio
from typing import List
from dataclasses import dataclass

@dataclass
class CompressionConfig:
    """Configuration for compression strategies."""
    summary_ratio: float = 0.3  # Keep 30% via summarization
    semantic_threshold: float = 0.8  # Clustering threshold
    enable_parallel: bool = True
    fallback_to_original: bool = True

class CompressedRAGPipeline:
    def __init__(self, config: CompressionConfig = None):
        self.config = config or CompressionConfig()
        self.compressor = ContextCompressor()
        self.semantic = SemanticCompressor()

    async def retrieve_and_compress(self, query: str, documents: List[str]) -> str:
        """Retrieve documents and apply multi-stage compression."""

        # Stage 1: Semantic deduplication
        deduped = self.semantic.compress(documents)

        # Stage 2: Summarization
        if self.config.enable_parallel:
            compressed = await self.compressor.compress_rag_context(
                deduped.split('\n\n')
            )
        else:
            compressed = deduped

        # Stage 3: Validate compression quality
        original_size = sum(len(d.split()) for d in documents)
        compressed_size = len(compressed.split())
        compression_ratio = compressed_size / original_size

        if compression_ratio > self.config.summary_ratio * 1.5:
            # Too much compression, use fallback
            if self.config.fallback_to_original:
                return "\n\n".join(documents[:3])  # Use top 3 original

        return compressed

# Production usage
async def main():
    config = CompressionConfig(summary_ratio=0.3)
    pipeline = CompressedRAGPipeline(config)

    # Simulate RAG retrieval
    documents = [
        "The API endpoint /v1/users accepts POST requests with JSON payload...",
        "User authentication requires Bearer token in Authorization header...",
        "The /v1/users endpoint returns 201 on success with user ID...",
        "Authentication errors return 401 with error message...",
        "Rate limiting is 1000 requests per hour per API key..."
    ]

    query = "How do I create a user?"
    compressed_context = await pipeline.retrieve_and_compress(query, documents)

    # Send to LLM

Common Pitfalls

Over-Compression and Accuracy Loss

The most frequent failure mode is aggressive compression that strips critical context. When compressing technical documentation or legal contracts, removing specific terms or conditions can cause the model to generate incorrect or even harmful outputs. Always validate compressed context against a holdout set of queries.

Red flags:

Compression ratios exceeding 60% without quality testing
Removing domain-specific terminology (e.g., “force majeure” in contracts)
Eliminating numerical values or constraints

Latency vs. Cost Trade-off Errors

Using a heavy summarization model (like GPT-4) to compress context for a lighter model defeats the purpose. The compression step becomes more expensive than the savings.

Correct approach:

Use Haiku-3.5 ($1.25/1M tokens) to compress for Sonnet-3.5 ($3.00/1M tokens)
Use gpt-4o-mini ($0.15/1M tokens) to compress for gpt-4o ($5.00/1M tokens)

Cost comparison:

Wrong: Compress 10,000 tokens with Sonnet ($0.03) → save $0.02 on GPT-4o = net loss
Right: Compress 10,000 tokens with Haiku ($0.0125) → save $0.02 on GPT-4o = net gain

Semantic Drift in Conversation History

Summarizing conversation history can cause the model to “forget” important user preferences or constraints mentioned earlier. The summary might preserve facts but lose nuance.

Mitigation:

Keep the last 2-3 turns verbatim
Summarize only older turns (4+ turns back)
Extract and preserve explicit constraints: “User prefers JSON output”, “Maximum 100 words”

Ignoring Tokenization Overhead

Compression algorithms themselves have overhead. The all-MiniLM-L6-v2 embedding model adds ~300 tokens of context for its instructions, and summarization prompts consume tokens too.

Real-world example:

Original: 10,000 tokens
Summarization prompt: 150 tokens
Summary output: 3,000 tokens
Total: 3,150 tokens (not 3,000)
Actual savings: 68.5%, not 70%

Fallback Failures

When compression fails or produces poor results, many systems lack graceful degradation. They either send the full context (wasting money) or send the poor summary (wasting the request).

Best practice:

# Always implement quality checks
async def safe_compress(self, text: str, min_quality: float = 0.7) -> str:
    compressed = await self.compress(text)
    quality_score = await self.assess_quality(text, compressed)

    if quality_score < min_quality:
        # Return original or partial original
        return self.truncate_intelligently(text, ratio=0.5)
    return compressed

Quick Reference

Compression Decision Matrix

Scenario	Recommended Technique	Expected Savings	Complexity
RAG with long documents	Summarization (Haiku)	60-70%	Low
Conversation history	Semantic clustering	40-50%	Medium
Multi-document retrieval	Summarization + Deduplication	50-65%	Medium
Real-time chat	Attention pruning	20-30%	High
Code repositories	Structure-aware summarization	45-60%	High

Model Pricing Cheat Sheet (Verified 2024-11-15)

Provider	Model	Input Cost/1M	Output Cost/1M	Best For
OpenAI	gpt-4o	$5.00	$15.00	Primary generation
OpenAI	gpt-4o-mini	$0.15	$0.60	Compression source
Anthropic	claude-3-5-sonnet	$3.00	$15.00	Primary generation
Anthropic	haiku-3.5	$1.25	$5.00	Compression source

Compression Ratio Targets by Content Type

Technical docs: 30-40% (preserve specifics)
Legal contracts: 20-30% (preserve all terms)
Conversation history: 50-60% (summarize old turns)
News articles: 60-70% (key facts only)
Research papers: 40-50% (methods + results)

Code Snippet Library

# Fast compression for RAG
async def compress_rag_batch(docs: List[str], model: str = "gpt-4o-mini") -> str:
    """Compress multiple documents in parallel."""
    compressor = ContextCompressor(model=model)
    summaries = await asyncio.gather(*[
        compressor.compress_document(doc, max_tokens=500)
        for doc in docs
    ])
    return "\n\n".join(summaries)

# Conversation history compression
def compress_conversation(turns: List[dict], keep_last_n: int = 2) -> str:
    """Keep last N turns verbatim, summarize older ones."""
    if len(turns) <= keep_last_n:
        return str(turns)

    recent = turns[-keep_n:]
    older = turns[:-keep_n]
    older_summary = summarize_turns(older)  # Your summarization logic

    return older_summary + "\n" + str(recent)

Compression simulator (before/after token count and quality metrics)

Interactive widget derived from “Token Compression Techniques: Reducing Context Length by 20-40%” that lets readers explore compression simulator (before/after token count and quality metrics).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Token compression is not a luxury—it’s a necessity for production LLM systems. The math is clear: 30% compression on 50,000 daily requests saves $30,600/month when using GPT-4o.

Key Takeaways

Use cheaper models for compression: gpt-4o-mini or haiku-3.5