Skip to content
GitHubX/TwitterRSS

Distributed LLM Inference: Tensor Parallelism & Pipeline Parallelism

Distributed LLM Inference: Scaling Beyond Single GPU Limits

Section titled “Distributed LLM Inference: Scaling Beyond Single GPU Limits”

When your 70B parameter model won’t fit on a single A100, or your latency requirements demand more than one GPU can deliver, distributed inference becomes essential. Tensor parallelism and pipeline parallelism can deliver 2-10x speedups, but they also introduce communication overhead that can erase gains if implemented incorrectly. This guide covers when to parallelize, how to minimize overhead, and production-ready configurations.

The economics of LLM inference are shifting from API costs to infrastructure efficiency. While GPT-4o charges $5/$15 per 1M input/output tokens, self-hosted distributed inference can reduce per-token costs by 70-90% at scale. However, the complexity of multi-GPU orchestration introduces failure modes that don’t exist in single-GPU deployments.

Recent benchmarks show that naive multi-GPU deployments often achieve only 1.2-1.5x speedup instead of the expected 2x, due to communication overhead and memory imbalances. The difference between a poorly parallelized system and an optimized one can be the difference between profitable inference and cost overruns.

Before diving into implementation, understand the fundamental tradeoff:

Model SizeSingle GPU Fit?Parallelism StrategyPrimary Benefit
Less than 7B parametersYesNoneLowest latency
7B - 13BYes (with quantization)Optional tensor parallelismHigher throughput
13B - 30BNoTensor parallelism (2-4 GPUs)Memory capacity
30B - 70BNoTensor + Pipeline (4-8 GPUs)Balanced throughput
Greater than 70BNoFull pipeline (8+ GPUs)Memory + throughput

Tensor parallelism distributes model weights across multiple GPUs. Each GPU holds a shard of the model’s parameters, and they collaborate on every forward pass.

When you split a 70B parameter model across 8 GPUs, each GPU holds approximately 8.75B parameters. During inference:

  1. Input distribution: The input tensor is replicated across all GPUs
  2. Computation: Each GPU computes partial results for its weight shard
  3. Communication: All-reduce operations synchronize results
  4. Output: Combined result is available on all GPUs

Google Cloud’s documentation emphasizes that “tensor parallelism is a technique that distributes computational load across multiple GPUs, which is essential when you run large models that exceed single GPU memory capacity.”

Use it when:

  • Model parameters exceed single GPU memory (e.g., 70B model needs ~140GB for FP16)
  • You need to serve larger models without quantization
  • Your hardware consists of identical GPUs
  • Latency is more critical than throughput

Avoid it when:

  • Model fits on single GPU (overhead exceeds benefit)
  • GPUs are heterogeneous (performance mismatches cause stragglers)
  • Batch size is small (communication overhead dominates)

Pipeline parallelism places different layers of the model on different GPUs. Unlike tensor parallelism which splits individual operations, pipeline parallelism splits the model architecture itself.

Pipeline parallelism achieves efficiency through micro-batching. While GPU 0 processes layer 1-8, GPU 1 processes layer 9-16 on a different micro-batch, enabling simultaneous computation.

The challenge is the “bubble” problem—GPUs sit idle when waiting for dependencies. Modern frameworks use techniques like:

  • Gradient accumulation: Process multiple micro-batches before weight updates
  • Interleaved scheduling: Assign multiple pipeline stages to each GPU
  • Memory optimization: Offload activations to CPU/NVMe

Use it when:

  • Model is very deep (many layers)
  • You want to maximize throughput over latency
  • Hardware is distributed across nodes (reduces inter-node bandwidth)
  • You need to serve multiple models simultaneously

Avoid it when:

  • Model is shallow (insufficient layers to parallelize)
  • Latency is the primary metric (pipeline adds overhead)
  • Batch sizes are inconsistent (causes pipeline stalls)
  1. Assess your requirements

    • Calculate model memory: parameters × precision × 2
    • Determine throughput needs: queries per second
    • Define latency budget: time-to-first-token (TTFT) and tokens-per-second (TPS)
  2. Choose your parallelism strategy

    • Tensor parallelism for memory capacity
    • Pipeline parallelism for throughput
    • Hybrid for very large models (70B+)
  3. Configure your serving engine

    • vLLM: Set tensor_parallel_size and pipeline_parallel_size
    • TensorRT-LLM: Define parallel configuration in engine build
    • Custom PyTorch: Implement torch.distributed primitives
  4. Deploy with proper orchestration

    • Use Kubernetes for production deployments
    • Configure health checks and auto-scaling
    • Monitor communication overhead metrics
tensor_parallel_vllm.py
import os
from vllm import LLM, SamplingParams
# Configure tensor parallelism for multi-GPU inference
# This distributes model weights across 2 GPUs
os.environ['VLLM_USE_RAY'] = '0'
# Initialize vLLM with tensor parallelism
# tensor_parallel_size determines how many GPUs to split the model across
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2, # Distributes across 2 GPUs
gpu_memory_utilization=0.95, # Use 95% of GPU memory
max_model_len=4096, # Context window size
enable_prefix_caching=True # Optimizes repeated prompt prefixes
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)
# Generate outputs with distributed inference
outputs = llm.generate(
prompts=["Explain quantum computing in simple terms:"],
sampling_params=sampling_params
)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}")
print(f"Tokens/sec: {len(output.outputs[0].token_ids) / output.outputs[0].latency:.2f}")

Avoid these mistakes that can negate the benefits of distributed inference:

  1. Using tensor parallelism for small models

    • Problem: For models under 7B parameters, communication overhead exceeds computational gains
    • Impact: 1.2-1.5x speedup instead of expected 2x
    • Solution: Use single GPU with quantization (AWQ/GPTQ) for models less than 13B
  2. Ignoring communication overhead

    • Problem: All-reduce operations block computation, especially across nodes
    • Impact: Efficiency drops to 60-70% beyond 8 GPUs
    • Solution: Use intra-node TP with inter-node PP; enable communication overlap
  3. Over-allocating GPU memory

    • Problem: Setting gpu_memory_utilization greater than 0.95 causes OOM during inference
    • Impact: Service crashes under load
    • Solution: Set utilization to 0.90-0.95 and monitor with Prometheus
  4. Not enabling prefix caching

    • Problem: Repeated prompt prefixes are recomputed
    • Impact: 2-3x lower throughput for chat/conversational workloads
    • Solution: Set enable_prefix_caching=True in vLLM
  5. Using standard HPA metrics

    • Problem: CPU/memory scaling doesn’t match LLM workload patterns
    • Impact: Slow response to traffic spikes, over-provisioning during lulls
    • Solution: Use custom metrics: queue length, batch size, or tokens/sec
  6. Deploying without health checks

    • Problem: Model loading failures go undetected
    • Impact: Cascading failures across replicas
    • Solution: Implement readiness/liveness probes checking /health endpoint
  7. Failing to overlap communication

    • Problem: GPUs idle during all-reduce operations
    • Impact: 20-30% performance loss
    • Solution: Use Ladder Residual architecture or async communication primitives
  8. Single GPU for high QPS

    • Problem: Model fits but can’t handle request volume
    • Impact: High latency, request timeouts
    • Solution: Use tensor parallelism even if model fits on one GPU
  9. Ignoring pipeline parallelism for deep models

    • Problem: Tensor parallelism alone doesn’t optimize for layer depth
    • Impact: Suboptimal throughput for 30B+ models
    • Solution: Combine TP (2-4 GPUs) with PP (2-4 stages) for balanced scaling
  10. Not monitoring communication efficiency

    • Problem: Can’t detect when scaling becomes inefficient
    • Impact: Wasted compute budget
    • Solution: Track all_reduce time vs. compute time ratio; target less than 15%
ScenarioStrategyGPUsExpected SpeedupNotes
7B model, low QPSNone11xUse quantization if needed
7B model, high QPSTP21.8xEnable prefix caching
13B modelTP21.7xFits on 2x L4 (16GB)
30B modelTP43.2xRequires 4x L4 or 2x A100
70B modelTP + PP86.5x4x TP + 2x PP recommended
100B+ modelFull pipeline16+12x+Consider TPU v6e
Model Size Calculation:
FP16: parameters × 2 bytes
FP8: parameters × 1 byte
INT8: parameters × 1 byte
INT4: parameters × 0.5 bytes
Examples:
Llama-2 7B: 7B × 2 = 14GB (FP16)
Llama-2 13B: 13B × 2 = 26GB (FP16)
Llama-2 70B: 70B × 2 = 140GB (FP16)
# Minimal production config
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=2,
pipeline_parallel_size=1,
gpu_memory_utilization=0.95,
max_model_len=4096,
enable_prefix_caching=True,
quantization="awq", # Optional: reduces memory by 50%
max_num_seqs=256, # Max concurrent requests
max_num_batched_tokens=4096,
)
# Per replica requirements
resources:
requests:
nvidia.com/gpu: 2
memory: 40Gi # 2× model size + overhead
cpu: 16
limits:
nvidia.com/gpu: 2
memory: 45Gi
cpu: 16

Parallelism benefit calculator (model size, hardware → speedup estimate)

Interactive widget derived from “Distributed LLM Inference: Tensor Parallelism & Pipeline Parallelism” that lets readers explore parallelism benefit calculator (model size, hardware → speedup estimate).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Distributed LLM inference is a trade-off between memory capacity, throughput, latency, and cost. Key takeaways:

When to parallelize:

  • Tensor Parallelism: Use for models greater than 13B parameters or when you need to serve larger models without quantization
  • Pipeline Parallelism: Use for deep models (30B+) where you want to maximize throughput over latency
  • Hybrid (TP + PP): Use for very large models (70B+) to balance memory and compute efficiency

When to avoid parallelism:

  • Models that fit comfortably on a single GPU (overhead exceeds benefit)
  • Low QPS scenarios where latency is acceptable
  • Heterogeneous GPU clusters without proper load balancing

Critical success factors:

  • Enable communication overlap (Ladder Residual or async primitives)
  • Monitor communication efficiency (target less than 15% overhead)
  • Use prefix caching for conversational workloads
  • Implement proper health checks and auto-scaling
  • Match parallelism strategy to hardware topology (intra-node TP, inter-node PP)

The goal isn’t maximum parallelism—it’s optimal efficiency for your specific workload and constraints.