Distributed LLM Inference: Tensor Parallelism & Pipeline Parallelism

Distributed LLM Inference: Scaling Beyond Single GPU Limits

When your 70B parameter model won’t fit on a single A100, or your latency requirements demand more than one GPU can deliver, distributed inference becomes essential. Tensor parallelism and pipeline parallelism can deliver 2-10x speedups, but they also introduce communication overhead that can erase gains if implemented incorrectly. This guide covers when to parallelize, how to minimize overhead, and production-ready configurations.

Why This Matters

The economics of LLM inference are shifting from API costs to infrastructure efficiency. While GPT-4o charges $5/$15 per 1M input/output tokens, self-hosted distributed inference can reduce per-token costs by 70-90% at scale. However, the complexity of multi-GPU orchestration introduces failure modes that don’t exist in single-GPU deployments.

Recent benchmarks show that naive multi-GPU deployments often achieve only 1.2-1.5x speedup instead of the expected 2x, due to communication overhead and memory imbalances. The difference between a poorly parallelized system and an optimized one can be the difference between profitable inference and cost overruns.

The Parallelism Decision Matrix

Before diving into implementation, understand the fundamental tradeoff:

Model Size	Single GPU Fit?	Parallelism Strategy	Primary Benefit
Less than 7B parameters	Yes	None	Lowest latency
7B - 13B	Yes (with quantization)	Optional tensor parallelism	Higher throughput
13B - 30B	No	Tensor parallelism (2-4 GPUs)	Memory capacity
30B - 70B	No	Tensor + Pipeline (4-8 GPUs)	Balanced throughput
Greater than 70B	No	Full pipeline (8+ GPUs)	Memory + throughput

Understanding Tensor Parallelism

Tensor parallelism distributes model weights across multiple GPUs. Each GPU holds a shard of the model’s parameters, and they collaborate on every forward pass.

How It Works

When you split a 70B parameter model across 8 GPUs, each GPU holds approximately 8.75B parameters. During inference:

Input distribution: The input tensor is replicated across all GPUs
Computation: Each GPU computes partial results for its weight shard
Communication: All-reduce operations synchronize results
Output: Combined result is available on all GPUs

Google Cloud’s documentation emphasizes that “tensor parallelism is a technique that distributes computational load across multiple GPUs, which is essential when you run large models that exceed single GPU memory capacity.”

When Tensor Parallelism Makes Sense

Use it when:

Model parameters exceed single GPU memory (e.g., 70B model needs ~140GB for FP16)
You need to serve larger models without quantization
Your hardware consists of identical GPUs
Latency is more critical than throughput

Avoid it when:

Model fits on single GPU (overhead exceeds benefit)
GPUs are heterogeneous (performance mismatches cause stragglers)
Batch size is small (communication overhead dominates)

Pipeline Parallelism Explained

Pipeline parallelism places different layers of the model on different GPUs. Unlike tensor parallelism which splits individual operations, pipeline parallelism splits the model architecture itself.

The Micro-Batch Strategy

Pipeline parallelism achieves efficiency through micro-batching. While GPU 0 processes layer 1-8, GPU 1 processes layer 9-16 on a different micro-batch, enabling simultaneous computation.

The challenge is the “bubble” problem—GPUs sit idle when waiting for dependencies. Modern frameworks use techniques like:

Gradient accumulation: Process multiple micro-batches before weight updates
Interleaved scheduling: Assign multiple pipeline stages to each GPU
Memory optimization: Offload activations to CPU/NVMe

When Pipeline Parallelism Shines

Use it when:

Model is very deep (many layers)
You want to maximize throughput over latency
Hardware is distributed across nodes (reduces inter-node bandwidth)
You need to serve multiple models simultaneously

Avoid it when:

Model is shallow (insufficient layers to parallelize)
Latency is the primary metric (pipeline adds overhead)
Batch sizes are inconsistent (causes pipeline stalls)

Practical Implementation

Assess your requirements
- Calculate model memory: parameters × precision × 2
- Determine throughput needs: queries per second
- Define latency budget: time-to-first-token (TTFT) and tokens-per-second (TPS)
Choose your parallelism strategy
- Tensor parallelism for memory capacity
- Pipeline parallelism for throughput
- Hybrid for very large models (70B+)
Configure your serving engine
- vLLM: Set tensor_parallel_size and pipeline_parallel_size
- TensorRT-LLM: Define parallel configuration in engine build
- Custom PyTorch: Implement torch.distributed primitives
Deploy with proper orchestration
- Use Kubernetes for production deployments
- Configure health checks and auto-scaling
- Monitor communication overhead metrics

Code Examples

import os
from vllm import LLM, SamplingParams

# Configure tensor parallelism for multi-GPU inference
# This distributes model weights across 2 GPUs
os.environ['VLLM_USE_RAY'] = '0'

# Initialize vLLM with tensor parallelism
# tensor_parallel_size determines how many GPUs to split the model across
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Distributes across 2 GPUs
    gpu_memory_utilization=0.95,  # Use 95% of GPU memory
    max_model_len=4096,  # Context window size
    enable_prefix_caching=True  # Optimizes repeated prompt prefixes
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)

# Generate outputs with distributed inference
outputs = llm.generate(
    prompts=["Explain quantum computing in simple terms:"],
    sampling_params=sampling_params
)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print(f"Tokens/sec: {len(output.outputs[0].token_ids) / output.outputs[0].latency:.2f}")

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.distributed.pipeline.sync import Pipe

class TransformerBlock(nn.Module):
    def __init__(self, hidden_size, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(hidden_size, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(hidden_size, 4 * hidden_size),
            nn.GELU(),
            nn.Linear(4 * hidden_size, hidden_size)
        )
        self.norm1 = nn.LayerNorm(hidden_size)
        self.norm2 = nn.LayerNorm(hidden_size)

    def forward(self, x):
        # Self-attention with residual connection
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)

        # Feed-forward with residual connection
        ff_out = self.feed_forward(x)
        x = self.norm2(x + ff_out)
        return x

def create_pipeline_model(num_layers, hidden_size, num_heads, num_gpus):
    """Create a model sharded across GPUs using pipeline parallelism"""
    layers = []
    layers_per_gpu = num_layers // num_gpus

    for i in range(num_gpus):
        gpu_layers = nn.ModuleList()
        for j in range(layers_per_gpu):
            gpu_layers.append(TransformerBlock(hidden_size, num_heads))
        layers.append(gpu_layers)

    # Wrap in Pipe for pipeline parallelism
    model = nn.Sequential(*layers)
    return Pipe(model, chunks=8)  # Split batch into 8 micro-batches

# Initialize distributed process group
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()

# Create pipeline model across 4 GPUs
if world_size == 4:
    pipeline_model = create_pipeline_model(
        num_layers=32,
        hidden_size=4096,
        num_heads=32,
        num_gpus=4
    )

    # Move model to appropriate GPU
    pipeline_model = pipeline_model.to(f'cuda:{rank}')

    # Example forward pass
    input_data = torch.randn(64, 1024, 4096).to(f'cuda:{rank}')
    output = pipeline_model(input_data)
    print(f"Rank {rank}: Processing complete")
else:
    print(f"This example requires exactly 4 GPUs, found {world_size}")

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-7b-tp2
  namespace: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama-7b
  template:
    metadata:
      labels:
        app: vllm-llama-7b
    spec:
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        resources:
          requests:
            nvidia.com/gpu: 2
            memory: 40Gi
            cpu: 16
          limits:
            nvidia.com/gpu: 2
            memory: 40Gi
            cpu: 16
        env:
        - name: MODEL_ID
          value: "meta-llama/Llama-2-7b-hf"
        - name: TENSOR_PARALLEL_SIZE
          value: "2"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.95"
        - name: MAX_MODEL_LEN
          value: "4096"
        - name: ENABLE_PREFIX_CACHING
          value: "true"
        ports:
        - containerPort: 8000
          name: http
        volumeMounts:
        - name: model-cache
          mountPath: /data
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      nodeSelector:
        accelerator: nvidia-l4
        cloud.google.com/gke-accelerator-count: "2"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-7b-service
  namespace: llm-inference
spec:
  selector:
    app: vllm-llama-7b
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

import torch
import torch.nn as nn
import torch.distributed as dist
from typing import Tuple

class LadderResidualBlock(nn.Module):
    """Ladder Residual for communication-computation overlapping"""
    def __init__(self, hidden_size, num_heads, rank):
        super().__init__()
        self.rank = rank
        self.world_size = dist.get_world_size()

        self.attention = nn.MultiheadAttention(hidden_size, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(hidden_size, 4 * hidden_size),
            nn.GELU(),
            nn.Linear(4 * hidden_size, hidden_size)
        )
        self.norm1 = nn.LayerNorm(hidden_size)
        self.norm2 = nn.LayerNorm(hidden_size)
        self.ladder_proj = nn.Linear(hidden_size, hidden_size)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        # Main path: Self-attention
        attn_out, _ = self.attention(x, x, x)
        x_main = self.norm1(x + attn_out)

        # Ladder path: Independent computation
        ladder_out = self.ladder_proj(x)

        # Overlap communication with computation
        if self.world_size > 1:
            ladder_comm = self._async_all_reduce(ladder_out)
        else:
            ladder_comm = ladder_out

        # Continue main path while communication happens
        ff_out = self.feed_forward(x_main)
        x_main = self.norm2(x_main + ff_out)

        # Wait and combine
        if self.world_size > 1:
            ladder_comm.wait()

        combined = x_main + ladder_comm
        return combined, ladder_out

    def _async_all_reduce(self, tensor: torch.Tensor):
        work = dist.all_reduce(tensor, async_op=True)
        return work

# Example usage requires torchrun

Common Pitfalls

Avoid these mistakes that can negate the benefits of distributed inference:

Using tensor parallelism for small models
- Problem: For models under 7B parameters, communication overhead exceeds computational gains
- Impact: 1.2-1.5x speedup instead of expected 2x
- Solution: Use single GPU with quantization (AWQ/GPTQ) for models less than 13B
Ignoring communication overhead
- Problem: All-reduce operations block computation, especially across nodes
- Impact: Efficiency drops to 60-70% beyond 8 GPUs
- Solution: Use intra-node TP with inter-node PP; enable communication overlap
Over-allocating GPU memory
- Problem: Setting gpu_memory_utilization greater than 0.95 causes OOM during inference
- Impact: Service crashes under load
- Solution: Set utilization to 0.90-0.95 and monitor with Prometheus
Not enabling prefix caching
- Problem: Repeated prompt prefixes are recomputed
- Impact: 2-3x lower throughput for chat/conversational workloads
- Solution: Set enable_prefix_caching=True in vLLM
Using standard HPA metrics
- Problem: CPU/memory scaling doesn’t match LLM workload patterns
- Impact: Slow response to traffic spikes, over-provisioning during lulls
- Solution: Use custom metrics: queue length, batch size, or tokens/sec
Deploying without health checks
- Problem: Model loading failures go undetected
- Impact: Cascading failures across replicas
- Solution: Implement readiness/liveness probes checking /health endpoint
Failing to overlap communication
- Problem: GPUs idle during all-reduce operations
- Impact: 20-30% performance loss
- Solution: Use Ladder Residual architecture or async communication primitives
Single GPU for high QPS
- Problem: Model fits but can’t handle request volume
- Impact: High latency, request timeouts
- Solution: Use tensor parallelism even if model fits on one GPU
Ignoring pipeline parallelism for deep models
- Problem: Tensor parallelism alone doesn’t optimize for layer depth
- Impact: Suboptimal throughput for 30B+ models
- Solution: Combine TP (2-4 GPUs) with PP (2-4 stages) for balanced scaling
Not monitoring communication efficiency
- Problem: Can’t detect when scaling becomes inefficient
- Impact: Wasted compute budget
- Solution: Track all_reduce time vs. compute time ratio; target less than 15%

Quick Reference

Parallelism Strategy Selection

Scenario	Strategy	GPUs	Expected Speedup	Notes
7B model, low QPS	None	1	1x	Use quantization if needed
7B model, high QPS	TP	2	1.8x	Enable prefix caching
13B model	TP	2	1.7x	Fits on 2x L4 (16GB)
30B model	TP	4	3.2x	Requires 4x L4 or 2x A100
70B model	TP + PP	8	6.5x	4x TP + 2x PP recommended
100B+ model	Full pipeline	16+	12x+	Consider TPU v6e

Memory Requirements

Model Size Calculation:
FP16:  parameters × 2 bytes
FP8:   parameters × 1 byte
INT8:  parameters × 1 byte
INT4:  parameters × 0.5 bytes

Examples:
Llama-2 7B:  7B × 2 = 14GB (FP16)
Llama-2 13B: 13B × 2 = 26GB (FP16)
Llama-2 70B: 70B × 2 = 140GB (FP16)

vLLM Configuration Cheat Sheet

# Minimal production config
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,
    pipeline_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_model_len=4096,
    enable_prefix_caching=True,
    quantization="awq",  # Optional: reduces memory by 50%
    max_num_seqs=256,    # Max concurrent requests
    max_num_batched_tokens=4096,
)

GKE Resource Requirements

# Per replica requirements
resources:
  requests:
    nvidia.com/gpu: 2
    memory: 40Gi      # 2× model size + overhead
    cpu: 16
  limits:
    nvidia.com/gpu: 2
    memory: 45Gi
    cpu: 16

Parallelism benefit calculator (model size, hardware → speedup estimate)

Interactive widget derived from “Distributed LLM Inference: Tensor Parallelism & Pipeline Parallelism” that lets readers explore parallelism benefit calculator (model size, hardware → speedup estimate).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Distributed LLM inference is a trade-off between memory capacity, throughput, latency, and cost. Key takeaways:

When to parallelize:

Tensor Parallelism: Use for models greater than 13B parameters or when you need to serve larger models without quantization
Pipeline Parallelism: Use for deep models (30B+) where you want to maximize throughput over latency
Hybrid (TP + PP): Use for very large models (70B+) to balance memory and compute efficiency

When to avoid parallelism:

Models that fit comfortably on a single GPU (overhead exceeds benefit)
Low QPS scenarios where latency is acceptable
Heterogeneous GPU clusters without proper load balancing

Critical success factors:

Enable communication overlap (Ladder Residual or async primitives)
Monitor communication efficiency (target less than 15% overhead)
Use prefix caching for conversational workloads
Implement proper health checks and auto-scaling
Match parallelism strategy to hardware topology (intra-node TP, inter-node PP)

The goal isn’t maximum parallelism—it’s optimal efficiency for your specific workload and constraints.