When your 70B parameter model won’t fit on a single A100, or your latency requirements demand more than one GPU can deliver, distributed inference becomes essential. Tensor parallelism and pipeline parallelism can deliver 2-10x speedups, but they also introduce communication overhead that can erase gains if implemented incorrectly. This guide covers when to parallelize, how to minimize overhead, and production-ready configurations.
The economics of LLM inference are shifting from API costs to infrastructure efficiency. While GPT-4o charges $5/$15 per 1M input/output tokens, self-hosted distributed inference can reduce per-token costs by 70-90% at scale. However, the complexity of multi-GPU orchestration introduces failure modes that don’t exist in single-GPU deployments.
Recent benchmarks show that naive multi-GPU deployments often achieve only 1.2-1.5x speedup instead of the expected 2x, due to communication overhead and memory imbalances. The difference between a poorly parallelized system and an optimized one can be the difference between profitable inference and cost overruns.
Before diving into implementation, understand the fundamental tradeoff:
Model Size Single GPU Fit? Parallelism Strategy Primary Benefit Less than 7B parameters Yes None Lowest latency 7B - 13B Yes (with quantization) Optional tensor parallelism Higher throughput 13B - 30B No Tensor parallelism (2-4 GPUs) Memory capacity 30B - 70B No Tensor + Pipeline (4-8 GPUs) Balanced throughput Greater than 70B No Full pipeline (8+ GPUs) Memory + throughput
Tensor parallelism distributes model weights across multiple GPUs. Each GPU holds a shard of the model’s parameters, and they collaborate on every forward pass.
When you split a 70B parameter model across 8 GPUs, each GPU holds approximately 8.75B parameters. During inference:
Input distribution : The input tensor is replicated across all GPUs
Computation : Each GPU computes partial results for its weight shard
Communication : All-reduce operations synchronize results
Output : Combined result is available on all GPUs
Google Cloud’s documentation emphasizes that “tensor parallelism is a technique that distributes computational load across multiple GPUs, which is essential when you run large models that exceed single GPU memory capacity.”
Use it when:
Model parameters exceed single GPU memory (e.g., 70B model needs ~140GB for FP16)
You need to serve larger models without quantization
Your hardware consists of identical GPUs
Latency is more critical than throughput
Avoid it when:
Model fits on single GPU (overhead exceeds benefit)
GPUs are heterogeneous (performance mismatches cause stragglers)
Batch size is small (communication overhead dominates)
Pipeline parallelism places different layers of the model on different GPUs. Unlike tensor parallelism which splits individual operations, pipeline parallelism splits the model architecture itself.
Pipeline parallelism achieves efficiency through micro-batching. While GPU 0 processes layer 1-8, GPU 1 processes layer 9-16 on a different micro-batch, enabling simultaneous computation.
The challenge is the “bubble” problem—GPUs sit idle when waiting for dependencies. Modern frameworks use techniques like:
Gradient accumulation : Process multiple micro-batches before weight updates
Interleaved scheduling : Assign multiple pipeline stages to each GPU
Memory optimization : Offload activations to CPU/NVMe
Use it when:
Model is very deep (many layers)
You want to maximize throughput over latency
Hardware is distributed across nodes (reduces inter-node bandwidth)
You need to serve multiple models simultaneously
Avoid it when:
Model is shallow (insufficient layers to parallelize)
Latency is the primary metric (pipeline adds overhead)
Batch sizes are inconsistent (causes pipeline stalls)
Assess your requirements
Calculate model memory: parameters × precision × 2
Determine throughput needs: queries per second
Define latency budget: time-to-first-token (TTFT) and tokens-per-second (TPS)
Choose your parallelism strategy
Tensor parallelism for memory capacity
Pipeline parallelism for throughput
Hybrid for very large models (70B+)
Configure your serving engine
vLLM: Set tensor_parallel_size and pipeline_parallel_size
TensorRT-LLM: Define parallel configuration in engine build
Custom PyTorch: Implement torch.distributed primitives
Deploy with proper orchestration
Use Kubernetes for production deployments
Configure health checks and auto-scaling
Monitor communication overhead metrics
from vllm import LLM, SamplingParams
# Configure tensor parallelism for multi-GPU inference
# This distributes model weights across 2 GPUs
os.environ['VLLM_USE_RAY'] = '0'
# Initialize vLLM with tensor parallelism
# tensor_parallel_size determines how many GPUs to split the model across
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2, # Distributes across 2 GPUs
gpu_memory_utilization=0.95, # Use 95% of GPU memory
max_model_len=4096, # Context window size
enable_prefix_caching=True # Optimizes repeated prompt prefixes
# Define sampling parameters
sampling_params = SamplingParams(
# Generate outputs with distributed inference
prompts=["Explain quantum computing in simple terms:"],
sampling_params=sampling_params
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}")
print(f"Tokens/sec: {len(output.outputs[0].token_ids) / output.outputs[0].latency:.2f}")
import torch.distributed as dist
from torch.distributed.pipeline.sync import Pipe
class TransformerBlock(nn.Module):
def __init__(self, hidden_size, num_heads):
self.attention = nn.MultiheadAttention(hidden_size, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(hidden_size, 4 * hidden_size),
nn.Linear(4 * hidden_size, hidden_size)
self.norm1 = nn.LayerNorm(hidden_size)
self.norm2 = nn.LayerNorm(hidden_size)
# Self-attention with residual connection
attn_out, _ = self.attention(x, x, x)
x = self.norm1(x + attn_out)
# Feed-forward with residual connection
ff_out = self.feed_forward(x)
x = self.norm2(x + ff_out)
def create_pipeline_model(num_layers, hidden_size, num_heads, num_gpus):
"""Create a model sharded across GPUs using pipeline parallelism"""
layers_per_gpu = num_layers // num_gpus
for i in range(num_gpus):
gpu_layers = nn.ModuleList()
for j in range(layers_per_gpu):
gpu_layers.append(TransformerBlock(hidden_size, num_heads))
layers.append(gpu_layers)
# Wrap in Pipe for pipeline parallelism
model = nn.Sequential(*layers)
return Pipe(model, chunks=8) # Split batch into 8 micro-batches
# Initialize distributed process group
dist.init_process_group(backend='nccl')
world_size = dist.get_world_size()
# Create pipeline model across 4 GPUs
pipeline_model = create_pipeline_model(
# Move model to appropriate GPU
pipeline_model = pipeline_model.to(f'cuda:{rank}')
input_data = torch.randn(64, 1024, 4096).to(f'cuda:{rank}')
output = pipeline_model(input_data)
print(f"Rank {rank}: Processing complete")
print(f"This example requires exactly 4 GPUs, found {world_size}")
image: vllm/vllm-openai:latest
value: "meta-llama/Llama-2-7b-hf"
- name: TENSOR_PARALLEL_SIZE
- name: GPU_MEMORY_UTILIZATION
- name: ENABLE_PREFIX_CACHING
claimName: model-cache-pvc
cloud.google.com/gke-accelerator-count: "2"
name: vllm-llama-7b-service
import torch.distributed as dist
class LadderResidualBlock(nn.Module):
"""Ladder Residual for communication-computation overlapping"""
def __init__(self, hidden_size, num_heads, rank):
self.world_size = dist.get_world_size()
self.attention = nn.MultiheadAttention(hidden_size, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(hidden_size, 4 * hidden_size),
nn.Linear(4 * hidden_size, hidden_size)
self.norm1 = nn.LayerNorm(hidden_size)
self.norm2 = nn.LayerNorm(hidden_size)
self.ladder_proj = nn.Linear(hidden_size, hidden_size)
def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
# Main path: Self-attention
attn_out, _ = self.attention(x, x, x)
x_main = self.norm1(x + attn_out)
# Ladder path: Independent computation
ladder_out = self.ladder_proj(x)
# Overlap communication with computation
ladder_comm = self._async_all_reduce(ladder_out)
# Continue main path while communication happens
ff_out = self.feed_forward(x_main)
x_main = self.norm2(x_main + ff_out)
combined = x_main + ladder_comm
return combined, ladder_out
def _async_all_reduce(self, tensor: torch.Tensor):
work = dist.all_reduce(tensor, async_op=True)
# Example usage requires torchrun
Avoid these mistakes that can negate the benefits of distributed inference:
Using tensor parallelism for small models
Problem : For models under 7B parameters, communication overhead exceeds computational gains
Impact : 1.2-1.5x speedup instead of expected 2x
Solution : Use single GPU with quantization (AWQ/GPTQ) for models less than 13B
Ignoring communication overhead
Problem : All-reduce operations block computation, especially across nodes
Impact : Efficiency drops to 60-70% beyond 8 GPUs
Solution : Use intra-node TP with inter-node PP; enable communication overlap
Over-allocating GPU memory
Problem : Setting gpu_memory_utilization greater than 0.95 causes OOM during inference
Impact : Service crashes under load
Solution : Set utilization to 0.90-0.95 and monitor with Prometheus
Not enabling prefix caching
Problem : Repeated prompt prefixes are recomputed
Impact : 2-3x lower throughput for chat/conversational workloads
Solution : Set enable_prefix_caching=True in vLLM
Using standard HPA metrics
Problem : CPU/memory scaling doesn’t match LLM workload patterns
Impact : Slow response to traffic spikes, over-provisioning during lulls
Solution : Use custom metrics: queue length, batch size, or tokens/sec
Deploying without health checks
Problem : Model loading failures go undetected
Impact : Cascading failures across replicas
Solution : Implement readiness/liveness probes checking /health endpoint
Failing to overlap communication
Problem : GPUs idle during all-reduce operations
Impact : 20-30% performance loss
Solution : Use Ladder Residual architecture or async communication primitives
Single GPU for high QPS
Problem : Model fits but can’t handle request volume
Impact : High latency, request timeouts
Solution : Use tensor parallelism even if model fits on one GPU
Ignoring pipeline parallelism for deep models
Problem : Tensor parallelism alone doesn’t optimize for layer depth
Impact : Suboptimal throughput for 30B+ models
Solution : Combine TP (2-4 GPUs) with PP (2-4 stages) for balanced scaling
Not monitoring communication efficiency
Problem : Can’t detect when scaling becomes inefficient
Impact : Wasted compute budget
Solution : Track all_reduce time vs. compute time ratio; target less than 15%
Scenario Strategy GPUs Expected Speedup Notes 7B model, low QPS None 1 1x Use quantization if needed 7B model, high QPS TP 2 1.8x Enable prefix caching 13B model TP 2 1.7x Fits on 2x L4 (16GB) 30B model TP 4 3.2x Requires 4x L4 or 2x A100 70B model TP + PP 8 6.5x 4x TP + 2x PP recommended 100B+ model Full pipeline 16+ 12x+ Consider TPU v6e
FP16: parameters × 2 bytes
INT8: parameters × 1 byte
INT4: parameters × 0.5 bytes
Llama-2 7B: 7B × 2 = 14GB (FP16)
Llama-2 13B: 13B × 2 = 26GB (FP16)
Llama-2 70B: 70B × 2 = 140GB (FP16)
# Minimal production config
model="meta-llama/Llama-2-7b-hf",
pipeline_parallel_size=1,
gpu_memory_utilization=0.95,
enable_prefix_caching=True,
quantization="awq", # Optional: reduces memory by 50%
max_num_seqs=256, # Max concurrent requests
max_num_batched_tokens=4096,
# Per replica requirements
memory: 40Gi # 2× model size + overhead
Parallelism benefit calculator (model size, hardware → speedup estimate)
Interactive widget derived from “Distributed LLM Inference: Tensor Parallelism & Pipeline Parallelism” that lets readers explore parallelism benefit calculator (model size, hardware → speedup estimate).
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.
Distributed LLM inference is a trade-off between memory capacity, throughput, latency, and cost. Key takeaways:
When to parallelize:
Tensor Parallelism : Use for models greater than 13B parameters or when you need to serve larger models without quantization
Pipeline Parallelism : Use for deep models (30B+) where you want to maximize throughput over latency
Hybrid (TP + PP) : Use for very large models (70B+) to balance memory and compute efficiency
When to avoid parallelism:
Models that fit comfortably on a single GPU (overhead exceeds benefit)
Low QPS scenarios where latency is acceptable
Heterogeneous GPU clusters without proper load balancing
Critical success factors:
Enable communication overlap (Ladder Residual or async primitives)
Monitor communication efficiency (target less than 15% overhead)
Use prefix caching for conversational workloads
Implement proper health checks and auto-scaling
Match parallelism strategy to hardware topology (intra-node TP, inter-node PP)
The goal isn’t maximum parallelism—it’s optimal efficiency for your specific workload and constraints.