Quantization for Latency: 4-Bit, 8-Bit Inference Speed Gains

Production LLM deployments face a critical bottleneck: latency. Every millisecond of delay directly impacts user experience, throughput, and cost. Quantization offers a proven path to 25-35% latency improvements while maintaining accuracy within acceptable bounds. This guide covers GPTQ, AWQ, and QuIP—three techniques that transform how you serve models in production.

Why Latency Optimization Matters

In production environments, latency is a business metric. A 100ms increase in response time can reduce conversion rates by 1-3% for customer-facing applications. For internal tools, high latency reduces productivity and adoption.

Current pricing models amplify the impact. Anthropic’s claude-3-5-sonnet costs $3.00 per million input tokens and $15.00 per million output tokens with a 200K context window (Anthropic Docs). OpenAI’s gpt-4o charges $5.00/$15.00 per million tokens for a 128K context (OpenAI Pricing). At scale, faster inference directly reduces token consumption and cost.

Quantization addresses both metrics simultaneously: lower latency improves user experience, while reduced computational requirements decrease infrastructure costs.

Understanding Model Quantization

Quantization reduces model weight precision from 16-bit floating-point (FP16) to lower bit-widths like 8-bit (INT8) or 4-bit (INT4). Each reduction in bit-width decreases memory bandwidth requirements and computational complexity.

The Math Behind Speed Gains

Modern GPUs achieve peak throughput with integer operations. An A100 GPU can perform 4x more INT8 operations than FP16 operations per clock cycle. When you quantize a model:

Memory bandwidth: Halved when moving from FP16 to INT8
Compute operations: 2-4x throughput increase for integer math
Cache efficiency: Smaller weights fit more data in cache

The result: 25-35% end-to-end latency reduction in production serving scenarios.

Accuracy vs. Speed Tradeoff

The key challenge is maintaining model quality. Aggressive quantization can introduce errors that degrade output quality. Modern techniques like GPTQ, AWQ, and QuIP minimize this loss through sophisticated calibration methods.

GPTQ: Post-Training Quantization

GPTQ (Gradient-based Post-Training Quantization) is the most widely adopted quantization method. It quantizes weights after training by solving a layer-wise optimization problem.

How GPTQ Works

GPTQ processes one layer at a time:

Quantizes weights to target bit-width
Updates remaining weights to compensate for quantization error
Uses Hessian-based optimization for minimal accuracy loss

This approach typically achieves <1% perplexity increase on 4-bit quantized models compared to FP16 baselines.

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import BaseQuantizeConfig, quantize_model

# Load model
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration data
calibration_texts = ["Your domain text here..."] * 128
calibration_data = [tokenizer.encode(text) for text in calibration_texts]

# Configure quantization
quant_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    dampening_frac=0.1,
    desc_act=True
)

# Quantize
quantized_model = quantize_model(model, quant_config, calibration_data)

# Save
quantized_model.save("./llama-2-7b-4bit-gptq")

# Install
pip install auto-gptq transformers

# Quantize with GPTQ
python -m auto_gptq.quantization \
  --model meta-llama/Llama-2-7b-hf \
  --bits 4 \
  --group_size 128 \
  --calibration_data ./calibration.json \
  --output_dir ./llama-2-7b-4bit

Expected Results

Model Size	FP16 Latency	4-bit GPTQ Latency	Accuracy Loss
7B	45ms	32ms (minus 29%)	less than 0.5%
13B	78ms	54ms (minus 31%)	less than 0.7%
70B	245ms	168ms (minus 31%)	less than 1.0%

Results on NVIDIA A100, batch size 1, sequence length 512

AWQ: Activation-Aware Optimization

AWQ (Activation-aware Weight Quantization) takes a different approach by analyzing activation patterns during calibration. This method protects salient weights based on activation statistics, achieving superior accuracy-efficiency trade-offs without requiring gradient computation.

How AWQ Works

AWQ’s core innovation is activation-aware quantization:

Observes Activation Patterns: Analyzes activation magnitudes on calibration data
Protects Salient Weights: Applies per-channel scaling to preserve important weights
Minimizes Quantization Error: Uses auto-clipping to reduce outlier impact
Efficient Search: Grid search finds optimal scaling factors per layer

This approach differs from traditional quantization by adapting to each model’s unique activation patterns rather than applying uniform quantization.

Implementation with AWQ

Python
CLI

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load model
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Quantize (no gradients needed - faster than GPTQ)
model.quantize(
    tokenizer,
    quant_config={
        "bits": 4,
        "group_size": 128,
        "zero_point": True,
        "act_scale": True
    },
    calib_data="wikitext",
    calib_size=128
)

# Save
model.save_quantized("./llama-2-7b-4bit-awq")

# Install
pip install auto-awq

# Quantize with AWQ
python -m awq.entry \
  --model_path meta-llama/Llama-2-7b-hf \
  --w_bit 4 \
  --q_group_size 128 \
  --q_backend real \
  --run_awq \
  --dump_awq awq_cache/llama-2-7b.pt \
  --dump_quant quant_cache/llama-2-7b-4bit.pt

AWQ vs GPTQ

Aspect	GPTQ	AWQ
Speed	Slower (gradients)	Faster (no gradients)
Accuracy	Excellent	Excellent
Memory	Higher during quant	Lower during quant
Calibration	128+ samples	128+ samples
Best For	General use	Fast deployment

QuIP: Extreme Compression

QuIP (Quantization with Incoherent Processing) pushes quantization to 2-4 bits using mathematical transformations that make weights more quantization-friendly.

How QuIP Works

QuIP uses two key techniques:

Incoherent Processing: Applies transformations to reduce weight-outlier impact
Hessian-Aware Quantization: Considers second-order information for precision loss

This enables extreme compression (2-4 bits) with acceptable accuracy tradeoffs for research and edge deployment.

Implementation with QuIP

from quip import quantize_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# QuIP supports 2, 3, and 4-bit quantization
quantized_model = quantize_model(
    model,
    bits=4,
    method="quip",
    calib_data="your_data",
    calib_size=128
)

Note: QuIP is primarily research-grade. For production, prefer GPTQ or AWQ.

Hardware-Specific Considerations

NVIDIA GPUs (A100/H100)

Modern NVIDIA GPUs provide dedicated INT4/INT8 tensor cores. The performance gains are substantial:

A100: 4x throughput for INT8 vs FP16
H100: 8x throughput for FP8/INT4 vs FP16
Memory bandwidth: 50% reduction with INT8, 75% with INT4

Apple Silicon

Recent profiling shows Apple Silicon benefits differently from quantization due to its unified memory architecture. While INT8 provides 15-20% gains, INT4 can sometimes be slower due to dequantization overhead.

Recommendation: Test INT8 on Apple Silicon before committing to INT4.

CPU Inference

For CPU deployment, quantization is even more critical. INT8 operations on modern CPUs can be 2-4x faster than FP16.

Production Deployment

vLLM Integration

vLLM supports AWQ and GPTQ out of the box:

# Deploy AWQ quantized model
vllm serve ./llama-2-7b-4bit-awq \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

TensorRT-LLM

For maximum performance on NVIDIA GPUs:

from tensorrt_llm import QuantizationConfig
from tensorrt_llm.quantization import quantize_and_build

quant_config = QuantizationConfig(
    algorithm="awq",
    bits=4,
    group_size=128
)

engine = quantize_and_build(
    model_path="meta-llama/Llama-2-7b-hf",
    quant_config=quant_config,
    output_dir="./trt_engine"
)

NVIDIA Model Optimizer

For FP8 quantization on H100:

import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
config = mtq.FP8_DEFAULT_CFG

def forward_loop(model):
    for data in calibration_set:
        model(data)

quantized_model = mtq.quantize(model, config, forward_loop)

Common Pitfalls

1. Using Generic Calibration Data

Problem: Quantizing with Wikipedia when serving legal documents
Solution: Use 128+ samples from your actual domain
Impact: Can cause 2-5% accuracy drop vs. 0.5% with proper calibration

2. Ignoring Group Size

Problem: Default group_size=128 may not be optimal
Solution: Test group sizes 64, 128, and 256
Impact: Wrong group size can increase latency by 10-15%

3. Skipping Per-Channel Scaling

Problem: Using per-tensor instead of per-channel quantization
Solution: Always enable per-channel (AWQ and GPTQ do this by default)
Impact: 3-5x accuracy degradation on large models

4. Deploying Without Benchmarking

Problem: Assuming quantization always helps
Solution: Benchmark end-to-end latency, not just layer speed
Impact: Some models see less than 10% improvement due to memory bottlenecks

5. Forgetting KV Cache Quantization

Problem: Quantizing weights but not attention cache
Solution: Use frameworks that support KV cache quantization
Impact: Cache can be 50% of memory usage in long-context scenarios

Quick Reference

Technique Selection Matrix

Scenario	Recommended Method	Bit-width	Expected Speedup
Production API	GPTQ	4-bit	25-35%
Fast deployment	AWQ	4-bit	20-30%
Research/edge	QuIP	2-4 bit	35-50%
GPU memory constrained	GPTQ	8-bit	15-20%

Hyperparameter Cheat Sheet

# GPTQ Defaults
bits = 4
group_size = 128
dampening_frac = 0.1
desc_act = True  # Important for accuracy

# AWQ Defaults
bits = 4
group_size = 128
zero_point = True
act_scale = True

# Calibration
num_samples = 128
seq_len = 512

Performance Checklist

Calibrate on domain-specific data (128+ samples)
Benchmark end-to-end latency, not just throughput
Test accuracy on validation set (perplexity less than +1%)
Verify GPU kernel support (INT8/INT4)
Monitor production metrics for 24h post-deployment
Plan rollback strategy if accuracy degrades

Quantization benchmark tool (model → latency/accuracy tradeoff)

Interactive widget derived from “Quantization for Latency: 4-Bit, 8-Bit Inference Speed Gains” that lets readers explore quantization benchmark tool (model → latency/accuracy tradeoff).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Quantization is no longer optional for production LLM deployment—it’s essential infrastructure. GPTQ, AWQ, and QuIP provide proven paths to 25-35% latency reduction with minimal accuracy loss.

Key takeaways:

GPTQ is your go-to for general production use
AWQ offers faster quantization with similar results
QuIP pushes boundaries for extreme compression
Calibration data quality is the #1 success factor
End-to-end benchmarking is mandatory, not optional

The financial impact is clear: if you’re serving 100M tokens/day, a 30% latency improvement saves ~$1,500/day on GPT-4o equivalent workloads. The time to quantize is measured in hours; the savings are measured in months.

Research Papers

GPTQ: “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” (Frantar et al., 2022) arXiv:2210.17323
AWQ: “AWQ: Activation-aware Weight Quantization” (Lin et al., 2023) arXiv:2306.00978
SmoothQuant: “SmoothQuant: Accurate and Efficient Post-Training Quantization” (Xiao et al., 2023) arXiv:2211.10438
any4: “any4: Learned 4-bit Numeric Representation for LLMs” (Elhoushi et al., 2025) arXiv:2507.04610

Implementation with GPTQ

Python
Command Line

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import BaseQuantizeConfig, quantize_model

# Load model
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration data
calibration_texts = ["Your domain text here..."] * 128
calibration_data = [tokenizer.encode(text) for text in calibration_texts]

# Configure quantization
quant_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    dampening_frac=0.1,
    desc_act=True
)

# Quantize
quantized_model = quantize_model(model, quant_config, calibration_data)

# Save
quantized_model.save("./llama-2-7b-4bit-gptq")

# Install
pip install auto-gptq transformers

# Quantize with GPTQ
python -m auto_gptq.quantization \
  --model meta-llama/Llama-2-7b-hf \
  --bits 4 \
  --group_size 128 \
  --calibration_data ./calibration.json \
  --output_dir ./llama-2-7b-4bit

Expected Results

Model Size	FP16 Latency	4-bit GPTQ Latency	Accuracy Loss
7B	45ms	32ms (minus 29%)	less than 0.5%
13B	78ms	54ms (minus 31%)	less than 0.7%
70B	245ms	168ms (minus 31%)	less than 1.0%

Results on NVIDIA A100, batch size 1, sequence length 512

AWQ: Activation-Aware Optimization

How AWQ Works

AWQ’s core innovation is activation-aware quantization:

Observes Activation Patterns: Analyzes activation magnitudes on calibration data
Protects Salient Weights: Applies per-channel scaling to preserve important weights
Minimizes Quantization Error: Uses auto-clipping to reduce outlier impact
Efficient Search: Grid search finds optimal scaling factors per layer

This approach differs from traditional quantization by adapting to each model’s unique activation patterns rather than applying uniform quantization.

Implementation with AWQ

Python
CLI

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load model
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Quantize (no gradients needed - faster than GPTQ)
model.quantize(
    tokenizer,
    quant_config={
        "bits": 4,
        "group_size": 128,
        "zero_point": True,
        "act_scale": True
    },
    calib_data="wikitext",
    calib_size=128
)

# Save
model.save_quantized("./llama-2-7b-4bit-awq")

# Install
pip install auto-awq

# Quantize with AWQ
python -m awq.entry \
  --model_path meta-llama/Llama-2-7b-hf \
  --w_bit 4 \
  --q_group_size 128 \
  --q_backend real \
  --run_awq \
  --dump_awq awq_cache/llama-2-7b.pt \
  --dump_quant quant_cache/llama-2-7b-4bit.pt

AWQ vs GPTQ

Aspect	GPTQ	AWQ
Speed	Slower (gradients)	Faster (no gradients)
Accuracy	Excellent	Excellent
Memory	Higher during quant	Lower during quant
Calibration	128+ samples	128+ samples
Best For	General use	Fast deployment

QuIP: Extreme Compression

QuIP (Quantization with Incoherent Processing) pushes quantization to 2-4 bits using mathematical transformations that make weights more quantization-friendly.

How QuIP Works

QuIP uses two key techniques:

Incoherent Processing: Applies transformations to reduce weight-outlier impact
Hessian-Aware Quantization: Considers second-order information for precision loss

This enables extreme compression (2-4 bits) with acceptable accuracy tradeoffs for research and edge deployment.

Implementation with QuIP

from quip import quantize_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# QuIP supports 2, 3, and 4-bit quantization
quantized_model = quantize_model(
    model,
    bits=4,
    method="quip",
    calib_data="your_data",
    calib_size=128
)

Note: QuIP is primarily research-grade. For production, prefer GPTQ or AWQ.

Hardware-Specific Considerations

NVIDIA GPUs (A100/H100)

Modern NVIDIA GPUs provide dedicated INT4/INT8 tensor cores. The performance gains are substantial:

A100: 4x throughput for INT8 vs FP16
H100: 8x throughput for FP8/INT4 vs FP16
Memory bandwidth: 50% reduction with INT8, 75% with INT4

Apple Silicon

Recommendation: Test INT8 on Apple Silicon before committing to INT4.

CPU Inference

For CPU deployment, quantization is even more critical. INT8 operations on modern CPUs can be 2-4x faster than FP16.

Production Deployment

vLLM Integration

vLLM supports AWQ and GPTQ out of the box:

# Deploy AWQ quantized model
vllm serve ./llama-2-7b-4bit-awq \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

TensorRT-LLM

For maximum performance on NVIDIA GPUs:

from tensorrt_llm import QuantizationConfig
from tensorrt_llm.quantization import quantize_and_build

quant_config = QuantizationConfig(
    algorithm="awq",
    bits=4,
    group_size=128
)

engine = quantize_and_build(
    model_path="meta-llama/Llama-2-7b-hf",
    quant_config=quant_config,
    output_dir="./trt_engine"
)

NVIDIA Model Optimizer

For FP8 quantization on H100:

import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
config = mtq.FP8_DEFAULT_CFG

def forward_loop(model):
    for data in calibration_set:
        model(data)

quantized_model = mtq.quantize(model, config, forward_loop)

Common Pitfalls

1. Using Generic Calibration Data

Problem: Quantizing with Wikipedia when serving legal documents
Solution: Use 128+ samples from your actual domain
Impact: Can cause 2-5% accuracy drop vs. 0.5% with proper calibration

2. Ignoring Group Size

Problem: Default group_size=128 may not be optimal
Solution: Test group sizes 64, 128, and 256
Impact: Wrong group size can increase latency by 10-15%

3. Skipping Per-Channel Scaling

Problem: Using per-tensor instead of per-channel quantization
Solution: Always enable per-channel (AWQ and GPTQ do this by default)
Impact: 3-5x accuracy degradation on large models

4. Deploying Without Benchmarking

Problem: Assuming quantization always helps
Solution: Benchmark end-to-end latency, not just layer speed
Impact: Some models see less than 10% improvement due to memory bottlenecks

5. Forgetting KV Cache Quantization

Problem: Quantizing weights but not attention cache
Solution: Use frameworks that support KV cache quantization
Impact: Cache can be 50% of memory usage in long-context scenarios

Quick Reference

Technique Selection Matrix

Scenario	Recommended Method	Bit-width	Expected Speedup
Production API	GPTQ	4-bit	25-35%
Fast deployment	AWQ	4-bit	20-30%
Research/edge	QuIP	2-4 bit	35-50%
GPU memory constrained	GPTQ	8-bit	15-20%

Hyperparameter Cheat Sheet

# GPTQ Defaults
bits = 4
group_size = 128
dampening_frac = 0.1
desc_act = True  # Important for accuracy

# AWQ Defaults
bits = 4
group_size = 128
zero_point = True
act_scale = True

# Calibration
num_samples = 128
seq_len = 512

Performance Checklist

Calibrate on domain-specific data (128+ samples)
Benchmark end-to-end latency, not just throughput
Test accuracy on validation set (perplexity less than +1%)
Verify GPU kernel support (INT8/INT4)
Monitor production metrics for 24h post-deployment
Plan rollback strategy if accuracy degrades

Quantization benchmark tool (model → latency/accuracy tradeoff)

Interactive widget derived from “Quantization for Latency: 4-Bit, 8-Bit Inference Speed Gains” that lets readers explore quantization benchmark tool (model → latency/accuracy tradeoff).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Quantization is no longer optional for production LLM deployment—it’s essential infrastructure. GPTQ, AWQ, and QuIP provide proven paths to 25-35% latency reduction with minimal accuracy loss.

Key takeaways:

GPTQ is your go-to for general production use
AWQ offers faster quantization with similar results
QuIP pushes boundaries for extreme compression
Calibration data quality is the #1 success factor
End-to-end benchmarking is mandatory, not optional

Research Papers

GPTQ: “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” (Frantar et al., 2022) arXiv:2210.17323
AWQ: “AWQ: Activation-aware Weight Quantization” (Lin et al., 2023) arXiv:2306.00978
SmoothQuant: “SmoothQuant: Accurate and Efficient Post-Training Quantization” (Xiao et al., 2023) arXiv:2211.10438
any4: “any4: Learned 4-bit Numeric Representation for LLMs” (Elhoushi et al., 2025) arXiv:2507.04610

Quantization for Latency: 4-Bit, 8-Bit Inference Speed Gains

Quantization for Latency: 4-Bit, 8-Bit Inference Speed Gains

Why Latency Optimization Matters

Understanding Model Quantization

The Math Behind Speed Gains

Accuracy vs. Speed Tradeoff

GPTQ: Post-Training Quantization

How GPTQ Works

Implementation with GPTQ

Expected Results

AWQ: Activation-Aware Optimization

How AWQ Works

Implementation with AWQ

AWQ vs GPTQ

QuIP: Extreme Compression

How QuIP Works

Implementation with QuIP

Hardware-Specific Considerations

NVIDIA GPUs (A100/H100)

Apple Silicon

CPU Inference

Production Deployment

vLLM Integration

TensorRT-LLM

NVIDIA Model Optimizer

Common Pitfalls

Quick Reference

Technique Selection Matrix

Hyperparameter Cheat Sheet

Performance Checklist

Widget

Summary

Related Resources

Research Papers

Implementation with GPTQ

Expected Results

AWQ: Activation-Aware Optimization

How AWQ Works

Implementation with AWQ

AWQ vs GPTQ

QuIP: Extreme Compression

How QuIP Works

Implementation with QuIP

Hardware-Specific Considerations

NVIDIA GPUs (A100/H100)

Apple Silicon

CPU Inference

Production Deployment

vLLM Integration

TensorRT-LLM

NVIDIA Model Optimizer

Common Pitfalls

Quick Reference

Technique Selection Matrix

Hyperparameter Cheat Sheet

Performance Checklist

Widget

Summary

Related Resources

Research Papers