Speculative Decoding: Faster LLM Inference with Draft Models

Speculative Decoding: Achieve 2-4x Faster LLM Inference with Draft Models

A single LLM inference request can burn thousands of tokens and hundreds of milliseconds. For production systems processing millions of requests, this latency compounds into user abandonment and infrastructure costs that scale linearly with throughput. Speculative decoding breaks this tradeoff by generating multiple tokens per forward pass, delivering 1.4x to 4x speedups without sacrificing output quality.

Meta achieved 4 ms per token latency on Llama 4 Maverick using EAGLE speculative decoding—an 10% improvement over previous methods and 1.4x-2.0x speedup at production scale ai.meta.com. This guide shows how to implement these techniques in your inference stack.

Why Speculative Decoding Matters

Traditional autoregressive decoding generates one token per forward pass, creating a fundamental bottleneck. GPU utilization remains low for small batch sizes because the model spends most cycles waiting for the next token prediction. This problem worsens for real-time applications requiring low latency.

Speculative decoding addresses this by:

Generating K tokens ahead using a lightweight draft model
Verifying all K tokens in parallel with a single forward pass of the target model
Accepting matching tokens and regenerating mismatches

The result: 2-4x throughput improvements for latency-sensitive workloads, with zero quality degradation. For a system processing 100K requests/day at $0.01/request, a 2x speedup can reduce infrastructure costs by $15K/month (assuming 50% GPU utilization improvement).

Cost Impact Analysis

Consider deploying GPT-4o-level inference at scale:

Metric	Standard Decoding	Speculative Decoding	Improvement
Tokens/sec/GPU	150	450	3x
Latency (per 500 tokens)	3.3s	1.1s	3x
Daily cost (1M requests)	$7,500	$2,500	67% savings

Costs based on GPT-4o pricing ($5.00/$15.00 per 1M tokens) and assuming 500 tokens average output.

How Speculative Decoding Works

Speculative decoding operates on a simple principle: verify is cheaper than generate. A small draft model generates K candidate tokens in K sequential forward passes (cheap), then the target model verifies all K tokens in a single forward pass (expensive).

The Core Algorithm

Why This Matters

Speculative decoding transforms inference economics. For production systems, latency is cost. Every millisecond saved translates to infrastructure savings and better user experience.

Real-World Cost Impact

Consider a production system processing 1M requests/day with 500 output tokens per request:

Standard Decoding (GPT-4o-level model):

Throughput: 150 tokens/sec/GPU
Daily GPU hours needed: 46.3 hours
Daily cost: $7,500

With Speculative Decoding (2.5x speedup):

Throughput: 375 tokens/sec/GPU
Daily GPU hours needed: 18.5 hours
Daily cost: $2,500
Savings: $5,000/day ($150K/month)

Performance by Workload

Task Type	Speedup	Draft Tokens (K)	Why It Works
Code Completion	3.0-4.0x	5-8	High n-gram overlap, predictable patterns
Chat/Dialogue	2.0-3.0x	3-5	Moderate predictability, context-dependent
Summarization	1.5-2.0x	2-4	Lower token-to-token correlation
Creative Writing	1.2-1.8x	2-3	High randomness, lower acceptance

Practical Implementation

Choosing Your Approach

Three main strategies exist, each with tradeoffs:

1. Draft-Target Models (Simplest)

Pros: No model modification, works with any pre-trained model
Cons: Memory overhead, requires model alignment
Best for: Quick deployment, heterogeneous model stacks

2. Medusa (Single-Model)

Pros: Single model, no alignment needed
Cons: Requires fine-tuning additional heads
Best for: Custom models where you control training

3. EAGLE (Hybrid)

Pros: Best speedups (1.4-2.0x at scale), production-optimized
Cons: Complex implementation, requires specific optimizations
Best for: Production systems with dedicated ML teams

Implementation Checklist

Before deploying speculative decoding, verify these prerequisites:

GPU Memory: Target + draft model must fit simultaneously
Batch Size: Confirm your typical batch size is less than 32
Draft Model: Select 5-10x smaller than target model
Token Count: Start with K=3-5 draft tokens, tune based on acceptance rate
Verification: Implement tree attention for Medusa/EAGLE
Fallback: Handle cases where no tokens are accepted

Code Example

Production-Ready Draft-Target Implementation

This implementation uses TensorRT-LLM for production deployment:

import tensorrt_llm
from tensorrt_llm import SamplingParams
from tensorrt_llm.executor import CppExecutor

class SpeculativeDecodingEngine:
    """
    Production-ready speculative decoding with draft-target model architecture.
    Optimized for TensorRT-LLM inference.
    """

    def __init__(self, draft_model_dir: str, target_model_dir: str):
        """
        Initialize draft and target model executors.

        Args:
            draft_model_dir: Path to compiled draft model (e.g., GPT-125M)
            target_model_dir: Path to compiled target model (e.g., GPT-6.7B)
        """
        # Draft model: smaller, faster (10x parameter reduction)
        self.draft_executor = CppExecutor(draft_model_dir)

        # Target model: larger, accurate
        self.target_executor = CppExecutor(target_model_dir)

        # Track metrics for optimization
        self.metrics = {
            'total_draft_tokens': 0,
            'total_accepted': 0,
            'acceptance_rate': 0.0
        }

    def generate(self, prompt: str, max_draft_tokens: int = 5,
                 max_iterations: int = 10) -> str:
        """
        Generate text using speculative decoding.

        Args:
            prompt: Input prompt
            max_draft_tokens: Tokens to speculate per iteration
            max_iterations: Safety limit to prevent infinite loops

        Returns:
            Generated text
        """
        current_prompt = prompt
        generated_tokens = []

        for iteration in range(max_iterations):
            # Step 1: Draft model generates K tokens
            draft_params = SamplingParams(
                max_tokens=max_draft_tokens,
                temperature=0.0,  # Greedy for draft
                top_p=0.9
            )

            draft_result = self.draft_executor.generate(
                current_prompt,
                draft_params
            )

            if not draft_result.tokens:
                break

            draft_tokens = draft_result.tokens
            self.metrics['total_draft_tokens'] += len(draft_tokens)

            # Step 2: Target model verifies in single forward pass
            extended_prompt = current_prompt + "".join(draft_tokens)

            target_params = SamplingParams(
                max_tokens=len(draft_tokens) + 1,
                temperature=0.0,
                return_log_probs=True
            )

            target_result = self.target_executor.generate(
                extended_prompt,
                target_params
            )

            # Step 3: Accept matching tokens
            accepted_tokens = []
            for i, draft_token in enumerate(draft_tokens):
                if i < len(target_result.tokens) and \
                   target_result.tokens[i] == draft_token:
                    accepted_tokens.append(draft_token)
                else:
                    break  # Stop at first mismatch

            # Step 4: Update prompt and output
            if accepted_tokens:
                generated_tokens.extend(accepted_tokens)
                current_prompt = extended_prompt
                self.metrics['total_accepted'] += len(accepted_tokens)

                # Update acceptance rate
                total_attempted = self.metrics['total_draft_tokens']
                if total_attempted > 0:
                    self.metrics['acceptance_rate'] = \
                        self.metrics['total_accepted'] / total_attempted

                # Check for completion
                if any(token in ['.', '!', '?'] for token in accepted_tokens[-1:]):
                    break
            else:
                # Fallback: generate one token from target
                if target_result.tokens:
                    single_token = target_result.tokens[0]
                    generated_tokens.append(single_token)
                    current_prompt += single_token

            # Safety: prevent runaway generation
            if len(generated_tokens) > 2000:
                break

        return "".join(generated_tokens)

    def get_performance_metrics(self) -> dict:
        """Return current performance metrics."""
        return self.metrics

# Example deployment
if __name__ == "__main__":
    # Initialize with compiled model paths
    # engine = SpeculativeDecodingEngine(
    #     draft_model_dir="/models/gpt-125m-trt",
    #     target_model_dir="/models/gpt-6.7b-trt"
    # )

    # result = engine.generate(
    #     prompt="The future of AI inference is",
    #     max_draft_tokens=4
    # )

    # print(f"Generated: {result}")
    # print(f"Metrics: {engine.get_performance_metrics()}")

    print("Deployment requires compiled TensorRT-LLM models")
    print("See: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/speculative_decoding")

Key Implementation Details

Draft Model Selection:

Use GPT-125M as draft for GPT-6.7B target (53x smaller)
Use GPT-1.3B as draft for GPT-6.7B target (5x smaller)
Rule of thumb: Draft model should be 5-10x smaller than target

Token Count Tuning:

# Optimal K values by task complexity
TASK_CONFIGS = {
    "code_completion": {"k": 6, "speedup": 3.5},
    "dialogue": {"k": 4, "speedup": 2.5},
    "summarization": {"k": 3, "speedup": 1.8},
    "creative": {"k": 2, "speedup": 1.5},
}

Memory Planning:

Standard decoding: 1x model + KV cache
Speculative decoding: 1x target + 1x draft + KV cache
Overhead: ~15-20% additional GPU memory

Common Pitfalls

1. Draft Model Too Large

Problem: Using a draft model that’s only 2-3x smaller than target. Impact: Speedup drops to 1.1-1.3x; memory overhead increases. Solution: Target 5-10x smaller draft models for optimal balance.

2. Incorrect Token Count

Problem: Setting K too high or too low for your workload. Impact: Low acceptance rate (<50%) wastes compute; high K increases latency. Solution: Start with K=3-5, tune based on measured acceptance rate.

3. Ignoring GPU Utilization

Problem: Deploying speculative decoding on fully saturated GPUs. Impact: Minimal speedup (1.1-1.3x) despite added complexity. Solution: Only deploy when batch sizes are less than 32 and GPU utilization is below 70%.

4. Memory Overcommitment

Problem: Failing to account for simultaneous model loading. Impact: OOM errors during production inference. Solution: Reserve 20% additional GPU memory capacity.

Quick Reference

Aspect	Recommendation	Notes
Draft Model Size	5-10x smaller than target	125M draft for 6.7B target; 1.3B draft for 13B target
Draft Tokens (K)	3-5 tokens	Tune based on acceptance rate; code=5-8, creative=2-3
Batch Size	1-32	Speedups diminish beyond 32; GPU becomes compute-bound
Acceptance Rate Target	60-80%	Below 50%: overhead exceeds benefit; Above 85%: increase K
Memory Overhead	+15-20% GPU memory	Plan for simultaneous draft + target loading
Temperature	0.0 for draft, match target for verification	Greedy drafting maximizes acceptance

Task-Specific Configuration

TASK_CONFIGS = {
    "code_completion": {"k": 6, "speedup": 3.5, "draft_model": "125M"},
    "dialogue": {"k": 4, "speedup": 2.5, "draft_model": "1.3B"},
    "summarization": {"k": 3, "speedup": 1.8, "draft_model": "1.3B"},
    "creative": {"k": 2, "speedup": 1.5, "draft_model": "125M"},
}

Performance Targets

Acceptance Rate: 60-80% (tune K to stay in this range)
Speedup: 1.5-4x depending on workload
Latency Reduction: 30-70% for batch size 1-8
Memory Increase: 15-20% over standard decoding

Speculative decoding efficiency calculator

Interactive widget derived from “Speculative Decoding: Faster LLM Inference with Draft Models” that lets readers explore speculative decoding efficiency calculator.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Key Takeaways

Speculative decoding delivers 1.4x-4x speedups by generating K tokens ahead with a draft model, then verifying them in parallel with the target model. Meta achieved 4 ms/token on Llama 4 Maverick using EAGLE.
Draft model selection is critical: Use models 5-10x smaller than your target. A 125M parameter draft for a 6.7B target model balances speedup and memory overhead.
Workload matters: Code completion sees 3-4x speedups due to high predictability, while creative writing sees 1.2-1.8x. Batch sizes of 1-32 provide optimal GPU utilization.
Memory overhead is 15-20% when loading both models simultaneously. Plan GPU capacity accordingly.
Three implementation paths:
- Draft-Target: Simplest, works with any models, but higher memory overhead
- Medusa: Single model with auxiliary heads, requires fine-tuning
- EAGLE: Best production performance (1.4-2.0x at scale), but complex implementation

When to Use Speculative Decoding

✅ Deploy when:

Batch sizes are consistently less than 32
GPU utilization is less than 70% without speculation
Output quality must be preserved exactly
You can allocate 15-20% additional GPU memory
Workload has predictable patterns (code, structured data)

❌ Avoid when:

Batch sizes are greater than 128 (GPU already saturated)
Memory is severely constrained
Using beam search (incompatible with Medusa/EAGLE)
Workload is highly random with low token-to-token correlation