Skip to content
GitHubX/TwitterRSS

Speculative Decoding: Faster LLM Inference with Draft Models

Speculative Decoding: Achieve 2-4x Faster LLM Inference with Draft Models

Section titled “Speculative Decoding: Achieve 2-4x Faster LLM Inference with Draft Models”

A single LLM inference request can burn thousands of tokens and hundreds of milliseconds. For production systems processing millions of requests, this latency compounds into user abandonment and infrastructure costs that scale linearly with throughput. Speculative decoding breaks this tradeoff by generating multiple tokens per forward pass, delivering 1.4x to 4x speedups without sacrificing output quality.

Meta achieved 4 ms per token latency on Llama 4 Maverick using EAGLE speculative decoding—an 10% improvement over previous methods and 1.4x-2.0x speedup at production scale ai.meta.com. This guide shows how to implement these techniques in your inference stack.

Traditional autoregressive decoding generates one token per forward pass, creating a fundamental bottleneck. GPU utilization remains low for small batch sizes because the model spends most cycles waiting for the next token prediction. This problem worsens for real-time applications requiring low latency.

Speculative decoding addresses this by:

  • Generating K tokens ahead using a lightweight draft model
  • Verifying all K tokens in parallel with a single forward pass of the target model
  • Accepting matching tokens and regenerating mismatches

The result: 2-4x throughput improvements for latency-sensitive workloads, with zero quality degradation. For a system processing 100K requests/day at $0.01/request, a 2x speedup can reduce infrastructure costs by $15K/month (assuming 50% GPU utilization improvement).

Consider deploying GPT-4o-level inference at scale:

MetricStandard DecodingSpeculative DecodingImprovement
Tokens/sec/GPU1504503x
Latency (per 500 tokens)3.3s1.1s3x
Daily cost (1M requests)$7,500$2,50067% savings

Costs based on GPT-4o pricing ($5.00/$15.00 per 1M tokens) and assuming 500 tokens average output.

Speculative decoding operates on a simple principle: verify is cheaper than generate. A small draft model generates K candidate tokens in K sequential forward passes (cheap), then the target model verifies all K tokens in a single forward pass (expensive).

Speculative decoding transforms inference economics. For production systems, latency is cost. Every millisecond saved translates to infrastructure savings and better user experience.

Consider a production system processing 1M requests/day with 500 output tokens per request:

Standard Decoding (GPT-4o-level model):

  • Throughput: 150 tokens/sec/GPU
  • Daily GPU hours needed: 46.3 hours
  • Daily cost: $7,500

With Speculative Decoding (2.5x speedup):

  • Throughput: 375 tokens/sec/GPU
  • Daily GPU hours needed: 18.5 hours
  • Daily cost: $2,500
  • Savings: $5,000/day ($150K/month)
Task TypeSpeedupDraft Tokens (K)Why It Works
Code Completion3.0-4.0x5-8High n-gram overlap, predictable patterns
Chat/Dialogue2.0-3.0x3-5Moderate predictability, context-dependent
Summarization1.5-2.0x2-4Lower token-to-token correlation
Creative Writing1.2-1.8x2-3High randomness, lower acceptance

Three main strategies exist, each with tradeoffs:

1. Draft-Target Models (Simplest)

  • Pros: No model modification, works with any pre-trained model
  • Cons: Memory overhead, requires model alignment
  • Best for: Quick deployment, heterogeneous model stacks

2. Medusa (Single-Model)

  • Pros: Single model, no alignment needed
  • Cons: Requires fine-tuning additional heads
  • Best for: Custom models where you control training

3. EAGLE (Hybrid)

  • Pros: Best speedups (1.4-2.0x at scale), production-optimized
  • Cons: Complex implementation, requires specific optimizations
  • Best for: Production systems with dedicated ML teams

Before deploying speculative decoding, verify these prerequisites:

  • GPU Memory: Target + draft model must fit simultaneously
  • Batch Size: Confirm your typical batch size is less than 32
  • Draft Model: Select 5-10x smaller than target model
  • Token Count: Start with K=3-5 draft tokens, tune based on acceptance rate
  • Verification: Implement tree attention for Medusa/EAGLE
  • Fallback: Handle cases where no tokens are accepted

This implementation uses TensorRT-LLM for production deployment:

import tensorrt_llm
from tensorrt_llm import SamplingParams
from tensorrt_llm.executor import CppExecutor
class SpeculativeDecodingEngine:
"""
Production-ready speculative decoding with draft-target model architecture.
Optimized for TensorRT-LLM inference.
"""
def __init__(self, draft_model_dir: str, target_model_dir: str):
"""
Initialize draft and target model executors.
Args:
draft_model_dir: Path to compiled draft model (e.g., GPT-125M)
target_model_dir: Path to compiled target model (e.g., GPT-6.7B)
"""
# Draft model: smaller, faster (10x parameter reduction)
self.draft_executor = CppExecutor(draft_model_dir)
# Target model: larger, accurate
self.target_executor = CppExecutor(target_model_dir)
# Track metrics for optimization
self.metrics = {
'total_draft_tokens': 0,
'total_accepted': 0,
'acceptance_rate': 0.0
}
def generate(self, prompt: str, max_draft_tokens: int = 5,
max_iterations: int = 10) -> str:
"""
Generate text using speculative decoding.
Args:
prompt: Input prompt
max_draft_tokens: Tokens to speculate per iteration
max_iterations: Safety limit to prevent infinite loops
Returns:
Generated text
"""
current_prompt = prompt
generated_tokens = []
for iteration in range(max_iterations):
# Step 1: Draft model generates K tokens
draft_params = SamplingParams(
max_tokens=max_draft_tokens,
temperature=0.0, # Greedy for draft
top_p=0.9
)
draft_result = self.draft_executor.generate(
current_prompt,
draft_params
)
if not draft_result.tokens:
break
draft_tokens = draft_result.tokens
self.metrics['total_draft_tokens'] += len(draft_tokens)
# Step 2: Target model verifies in single forward pass
extended_prompt = current_prompt + "".join(draft_tokens)
target_params = SamplingParams(
max_tokens=len(draft_tokens) + 1,
temperature=0.0,
return_log_probs=True
)
target_result = self.target_executor.generate(
extended_prompt,
target_params
)
# Step 3: Accept matching tokens
accepted_tokens = []
for i, draft_token in enumerate(draft_tokens):
if i < len(target_result.tokens) and \
target_result.tokens[i] == draft_token:
accepted_tokens.append(draft_token)
else:
break # Stop at first mismatch
# Step 4: Update prompt and output
if accepted_tokens:
generated_tokens.extend(accepted_tokens)
current_prompt = extended_prompt
self.metrics['total_accepted'] += len(accepted_tokens)
# Update acceptance rate
total_attempted = self.metrics['total_draft_tokens']
if total_attempted > 0:
self.metrics['acceptance_rate'] = \
self.metrics['total_accepted'] / total_attempted
# Check for completion
if any(token in ['.', '!', '?'] for token in accepted_tokens[-1:]):
break
else:
# Fallback: generate one token from target
if target_result.tokens:
single_token = target_result.tokens[0]
generated_tokens.append(single_token)
current_prompt += single_token
# Safety: prevent runaway generation
if len(generated_tokens) > 2000:
break
return "".join(generated_tokens)
def get_performance_metrics(self) -> dict:
"""Return current performance metrics."""
return self.metrics
# Example deployment
if __name__ == "__main__":
# Initialize with compiled model paths
# engine = SpeculativeDecodingEngine(
# draft_model_dir="/models/gpt-125m-trt",
# target_model_dir="/models/gpt-6.7b-trt"
# )
# result = engine.generate(
# prompt="The future of AI inference is",
# max_draft_tokens=4
# )
# print(f"Generated: {result}")
# print(f"Metrics: {engine.get_performance_metrics()}")
print("Deployment requires compiled TensorRT-LLM models")
print("See: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/speculative_decoding")

Draft Model Selection:

  • Use GPT-125M as draft for GPT-6.7B target (53x smaller)
  • Use GPT-1.3B as draft for GPT-6.7B target (5x smaller)
  • Rule of thumb: Draft model should be 5-10x smaller than target

Token Count Tuning:

# Optimal K values by task complexity
TASK_CONFIGS = {
"code_completion": {"k": 6, "speedup": 3.5},
"dialogue": {"k": 4, "speedup": 2.5},
"summarization": {"k": 3, "speedup": 1.8},
"creative": {"k": 2, "speedup": 1.5},
}

Memory Planning:

Standard decoding: 1x model + KV cache
Speculative decoding: 1x target + 1x draft + KV cache
Overhead: ~15-20% additional GPU memory

Problem: Using a draft model that’s only 2-3x smaller than target. Impact: Speedup drops to 1.1-1.3x; memory overhead increases. Solution: Target 5-10x smaller draft models for optimal balance.

Problem: Setting K too high or too low for your workload. Impact: Low acceptance rate (<50%) wastes compute; high K increases latency. Solution: Start with K=3-5, tune based on measured acceptance rate.

Problem: Deploying speculative decoding on fully saturated GPUs. Impact: Minimal speedup (1.1-1.3x) despite added complexity. Solution: Only deploy when batch sizes are less than 32 and GPU utilization is below 70%.

Problem: Failing to account for simultaneous model loading. Impact: OOM errors during production inference. Solution: Reserve 20% additional GPU memory capacity.

AspectRecommendationNotes
Draft Model Size5-10x smaller than target125M draft for 6.7B target; 1.3B draft for 13B target
Draft Tokens (K)3-5 tokensTune based on acceptance rate; code=5-8, creative=2-3
Batch Size1-32Speedups diminish beyond 32; GPU becomes compute-bound
Acceptance Rate Target60-80%Below 50%: overhead exceeds benefit; Above 85%: increase K
Memory Overhead+15-20% GPU memoryPlan for simultaneous draft + target loading
Temperature0.0 for draft, match target for verificationGreedy drafting maximizes acceptance
TASK_CONFIGS = {
"code_completion": {"k": 6, "speedup": 3.5, "draft_model": "125M"},
"dialogue": {"k": 4, "speedup": 2.5, "draft_model": "1.3B"},
"summarization": {"k": 3, "speedup": 1.8, "draft_model": "1.3B"},
"creative": {"k": 2, "speedup": 1.5, "draft_model": "125M"},
}
  • Acceptance Rate: 60-80% (tune K to stay in this range)
  • Speedup: 1.5-4x depending on workload
  • Latency Reduction: 30-70% for batch size 1-8
  • Memory Increase: 15-20% over standard decoding

Speculative decoding efficiency calculator

Interactive widget derived from “Speculative Decoding: Faster LLM Inference with Draft Models” that lets readers explore speculative decoding efficiency calculator.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

  1. Speculative decoding delivers 1.4x-4x speedups by generating K tokens ahead with a draft model, then verifying them in parallel with the target model. Meta achieved 4 ms/token on Llama 4 Maverick using EAGLE.

  2. Draft model selection is critical: Use models 5-10x smaller than your target. A 125M parameter draft for a 6.7B target model balances speedup and memory overhead.

  3. Workload matters: Code completion sees 3-4x speedups due to high predictability, while creative writing sees 1.2-1.8x. Batch sizes of 1-32 provide optimal GPU utilization.

  4. Memory overhead is 15-20% when loading both models simultaneously. Plan GPU capacity accordingly.

  5. Three implementation paths:

    • Draft-Target: Simplest, works with any models, but higher memory overhead
    • Medusa: Single model with auxiliary heads, requires fine-tuning
    • EAGLE: Best production performance (1.4-2.0x at scale), but complex implementation

✅ Deploy when:

  • Batch sizes are consistently less than 32
  • GPU utilization is less than 70% without speculation
  • Output quality must be preserved exactly
  • You can allocate 15-20% additional GPU memory
  • Workload has predictable patterns (code, structured data)

❌ Avoid when:

  • Batch sizes are greater than 128 (GPU already saturated)
  • Memory is severely constrained
  • Using beam search (incompatible with Medusa/EAGLE)
  • Workload is highly random with low token-to-token correlation