Skip to content
GitHubX/TwitterRSS

Vector Search Optimization: ANN Algorithms & Indexing Strategies

Vector Search Optimization: ANN Algorithms & Indexing Strategies

Section titled “Vector Search Optimization: ANN Algorithms & Indexing Strategies”

A production RAG pipeline at a Series C startup was delivering 450ms query latency—violating their 100ms SLA and costing them enterprise deals. The culprit wasn’t their embedding model or LLM inference time. It was a brute-force vector search scanning 2.3M documents on every query. After implementing HNSW indexing with proper parameter tuning, latency dropped to 28ms and they scaled to 500 QPS on the same infrastructure. This guide covers the approximate nearest neighbor (ANN) algorithms, index tuning strategies, and sharding patterns that separate production-ready vector search from toy demos.

Vector search performance directly impacts user experience and infrastructure costs. A poorly configured index can 10x your compute bill while delivering unacceptable latency. According to Databricks’ performance benchmarks, standard vector search SKUs plateau at ~30 QPS when scaling beyond a single search unit, while optimized configurations can deliver 200+ QPS with less than 50ms latency.

The cost implications are significant. Using brute-force search (exact k-NN) for datasets exceeding 20,000 filtered documents becomes economically unviable. Each query requires O(n) comparisons, meaning 1M vectors with 768 dimensions translates to 768MB of data scanned per query. At 100 QPS, that’s 76.8GB/second of memory bandwidth—translating to expensive GPU or high-end CPU instances.

For AI engineering teams, understanding ANN algorithms isn’t optional. It’s the difference between a system that scales to millions of queries at reasonable cost versus one that requires constant infrastructure upgrades and delivers poor user experience.

Approximate Nearest Neighbor algorithms trade perfect accuracy for orders-of-magnitude speed improvements. Instead of comparing every vector, they use clever data structures to prune the search space dramatically.

HNSW is the current gold standard for recall-latency tradeoffs. It builds a multi-layer graph where each layer is a subset of the previous, allowing logarithmic-time search.

How it works:

  • Construction: Build a graph where nodes (vectors) connect to their nearest neighbors
  • Search: Start at a coarse entry point, greedily traverse to closer neighbors
  • Layers: Multiple layers allow fast coarse navigation (top layers) followed by fine-grained search (bottom layers)

Key parameters:

  • M (4-64): Number of bi-directional links per node. Higher M = better accuracy but more memory. Databricks recommends M=16-32 for most use cases.
  • ef_construction (100-400): Size of candidate list during construction. Higher = better index quality but slower build time.
  • ef_search (64-512): Search-time width. Higher = better recall but slower queries.

Memory usage: Approximately (dimension * 4 + M * 2 * 4) bytes per vector. For 768-dim vectors with M=16: ~3.1KB per vector.

IVF-PQ (Inverted File + Product Quantization)

Section titled “IVF-PQ (Inverted File + Product Quantization)”

IVF-PQ is the memory champion. It partitions vectors into clusters (inverted files) and compresses vectors using product quantization.

How it works:

  • IVF: Partition space into nlist clusters using k-means
  • PQ: Decompose vectors into m subvectors and compress each to code_size bits
  • Search: Only search nprobe clusters, then compute approximate distances using quantized codes

Key parameters:

  • nlist (4√N to 16√N): Number of clusters. For 1M vectors, nlist=1024-4096
  • m: Subvectors (must divide dimension). Common: 32 for 256-dim, 48 for 768-dim
  • code_size (4-8 bits): Bits per subvector. 8-bit = good balance
  • nprobe (nlist/64 to nlist/16): Clusters to search. Start with nprobe = nlist/64

Memory savings: 768-dim float32 = 3KB. 768-dim PQ (m=48, 8-bit) = 48 bytes. ~60x reduction.

ScaNN is Google’s ANN library that excels at anisotropic vector compression. It optimizes for the specific distribution of your data rather than using uniform quantization.

Key features:

  • Anisotropic quantization: Weights different parts of the vector space based on data density
  • Voronoi diagram: More sophisticated partitioning than simple k-means
  • GPU acceleration: Optimized for modern TPUs and GPUs

ScaNN is particularly effective when your vectors have non-uniform density or when you need to optimize for a specific recall target.

The relationship between parameters is non-linear. Small changes can dramatically affect both performance and recall.

Construction vs. Search Tradeoffs:

  • High ef_construction + High M: Best recall, slow builds, large memory
  • Low ef_construction + Low M: Fast builds, poor recall, smaller memory
  • Balanced: M=16-24, ef_construction=150-250 for most production workloads

Search Parameter Tuning:

Building production vector search requires more than understanding theory—you need battle-tested patterns for deployment, monitoring, and scaling. The following sections cover implementation strategies that work at scale.

Before going live, verify these critical configurations:

Index Construction:

  • Pre-train IVF-PQ indexes on 10% of your dataset (minimum 1,000 vectors)
  • Set HNSW M between 16-32 for optimal recall-latency balance
  • Configure ef_construction 150-250 for quality index builds
  • Validate vector normalization when using cosine similarity
  • Test recall@k on holdout data before production deployment

Infrastructure:

  • Use service principals with OAuth instead of PATs (saves 100ms+ latency)
  • Enable circuit breakers (e.g., knn.memory.circuit_breaker.limit=60 in OpenSearch)
  • Configure proper shard count based on data size and QPS requirements
  • Set up monitoring for latency, QPS, and recall metrics
  • Plan for query spikes with exponential backoff and retry logic

Query Optimization:

  • Reuse index objects across queries (avoid reinitialization overhead)
  • Set num_results in 10-100 range unless application requires more
  • Use ANN queries unless hybrid search is explicitly needed
  • Implement batch processing for insertions and queries
  • Cache frequently accessed indexes in memory

Horizontal Scaling: When single-instance performance plateaus, use these patterns:

  1. Split indexes across endpoints: Host multiple indexes on separate endpoints to increase total bandwidth
  2. Replicate indexes: Duplicate high-traffic indexes across endpoints and split traffic evenly
  3. User-defined sharding: Partition by tenant_id or region to reduce cross-node traffic
  4. Multitenancy optimization: Use logical partitioning with payload filters instead of separate collections

Vertical Scaling:

  • Standard SKU: Best for less than 320M vectors, 20-50ms latency, 30-200+ QPS
  • Storage-optimized SKU: Best for 10M+ vectors, 300-500ms latency, 7x cost savings per vector

Complete HNSW Implementation with Memory Optimization

Section titled “Complete HNSW Implementation with Memory Optimization”

This example demonstrates a production-ready HNSW index with parameter tuning, memory estimation, and error handling.

import faiss
import numpy as np
import time
from typing import Tuple, List, Dict
class OptimizedVectorSearch:
"""
Production-ready vector search with HNSW indexing.
Optimized for sub-50ms latency at scale.
"""
def __init__(self, dimension: int, max_elements: int = 1000000):
self.dimension = dimension
self.max_elements = max_elements
# HNSW parameters tuned for production
# M=16 provides good recall with reasonable memory
self.M = 16
self.ef_construction = 200 # Quality during build
self.ef_search = 64 # Quality during query
self.index = None
self.is_trained = False
def create_index(self) -> faiss.IndexHNSWFlat:
"""Create HNSW index with optimized parameters."""
try:
# Initialize HNSW index
index = faiss.IndexHNSWFlat(self.dimension, self.M)
# Set construction parameters
index.hnsw.efConstruction = self.ef_construction
# Set search parameters (can be adjusted per query)
index.hnsw.efSearch = self.ef_search
# Memory estimation
bytes_per_vector = (self.dimension * 4 + self.M * 2 * 4)
memory_gb = (bytes_per_vector * self.max_elements) / (1024**3)
print(f"✓ HNSW Index created:")
print(f" Dimension: {self.dimension}")
print(f" M (links/node): {self.M}")
print(f" ef_construction: {self.ef_construction}")
print(f" ef_search: {self.ef_search}")
print(f" Max elements: {self.max_elements:,}")
print(f" Estimated memory: {memory_gb:.2f} GB")
self.index = index
self.is_trained = True
return index
except Exception as e:
print(f"✗ Index creation failed: {e}")
raise
def add_vectors(self, vectors: np.ndarray, batch_size: int = 50000) -> None:
"""Add vectors in batches to avoid memory issues."""
if not self.is_trained:
raise ValueError("Index not initialized")
if vectors.shape[0] > self.max_elements:
raise ValueError(f"Vector count {vectors.shape[0]} exceeds max_elements {self.max_elements}")
start_time = time.time()
total_added = 0
for i in range(0, len(vectors), batch_size):
batch = vectors[i:i+batch_size]
self.index.add(batch.astype('float32'))
total_added += len(batch)
if i % 100000 == 0 and i > 0:
print(f" Progress: {total_added:,} vectors added")
elapsed = time.time() - start_time
print(f"✓ Added {total_added:,} vectors in {elapsed:.2f}s "
f"({total_added/elapsed:.0f} vectors/sec)")
def search(self, query: np.ndarray, k: int = 10,
ef_search: int = None) -> Tuple[np.ndarray, np.ndarray]:
"""
Search for k nearest neighbors.
Args:
query: Query vector (1, dimension)
k: Number of results
ef_search: Search width (higher = better recall, slower)
Returns:
distances, indices
"""
if not self.is_trained:
raise ValueError("Index not initialized")
# Adjust search width if needed
if ef_search:
self.index.hnsw.efSearch = ef_search
# Validate query
if query.shape[1] != self.dimension:
raise ValueError(f"Query dimension mismatch: expected {self.dimension}")
start_time = time.time()
distances, indices = self.index.search(query.astype('float32'), k)
elapsed = time.time() - start_time
print(f"Query: {elapsed*1000:.2f}ms, k={k}, ef={self.index.hnsw.efSearch}")
return distances, indices
def tune_for_recall(self, test_queries: np.ndarray,
test_neighbors: np.ndarray,
target_recall: float = 0.95) -> Dict:
"""
Auto-tune ef_search to achieve target recall.
Requires ground truth neighbors for validation.
"""
print(f"\nTuning for {target_recall:.0%} recall...")
best_ef = None
best_recall = 0
for ef in [32, 64, 128, 256, 512]:
self.index.hnsw.efSearch = ef
correct = 0
for i, query in enumerate(test_queries):
query = query.reshape(1, -1)
_, indices = self.search(query, k=test_neighbors.shape[1])
# Check if ground truth neighbors are in results
if test_neighbors[i][0] in indices[0]:
correct += 1
recall = correct / len(test_queries)
print(f" ef={ef}: recall={recall:.3f}")
if recall >= target_recall and recall > best_recall:
best_ef = ef
best_recall = recall
if recall >= target_recall:
break
if best_ef:
self.index.hnsw.efSearch = best_ef
print(f"✓ Optimal ef_search: {best_ef} (recall: {best_recall:.3f})")
else:
print("⚠ Could not achieve target recall with available ef values")
return {'optimal_ef': best_ef, 'recall': best_recall}
# Production usage example
if __name__ == "__main__":
# Configuration for 1M vectors, 768-dim
DIM = 768
N_VECTORS = 100000 # Demo with 100k
print("="*60)
print("PRODUCTION VECTOR SEARCH SETUP")
print("="*60)
# Initialize
search = OptimizedVectorSearch(dimension=DIM, max_elements=N_VECTORS)
search.create_index()
# Generate sample data
print("\nGenerating sample vectors...")
np.random.seed(42)
vectors = np.random.random((N_VECTORS, DIM)).astype('float32')
# Add vectors
print("\nAdding vectors to index...")
search.add_vectors(vectors)
# Search
print("\n" + "="*60)
print("SEARCH PERFORMANCE")
print("="*60)
query = np.random.random((1, DIM)).astype('float32')

Avoiding these mistakes is often more impactful than tuning parameters. These pitfalls are compiled from production vector search deployments and performance analyses.

AlgorithmParameterRecommended RangeImpact
HNSWM16-32Higher = better recall, more memory
ef_construction150-250Higher = better index quality, slower build
ef_search64-512Higher = better recall, slower queries
IVF-PQnlist4√N to 16√NMore clusters = faster search, lower recall
nprobenlist/64 to nlist/16Higher = better recall, slower queries
mdimension/8 to dimension/4Must divide dimension
code_size8 bitsBalance between memory and accuracy

HNSW: (dimension * 4 + M * 2 * 4) bytes per vector
Example: 768-dim, M=16 → ~3.1KB/vector → 3.1GB for 1M vectors

IVF-PQ: (code_size / 8) * m + 24 bytes per vector
Example: 768-dim, m=48, 8-bit → 48 bytes/vector → 48MB for 1M vectors

MetricTargetNotes
Latency (p95)less than 50msIncluding embedding generation if applicable
Recall@10greater than 0.95Adjust based on application requirements
QPS Capacity100-200+Depends on SKU and infrastructure
Memory Budgetless than 50% RAMLeave headroom for OS and caching

Use this interactive calculator to estimate your index configuration and resource requirements.

class VectorSearchCalculator:
"""
Calculate optimal index configuration and resource requirements
based on your dataset and performance targets.
"""
def __init__(self, num_vectors, dimension, target_qps, target_latency_ms):
self.N = num_vectors
self.D = dimension
self.target_qps = target_qps
self.target_latency = target_latency_ms
def calculate_hnsw_config(self):
"""Calculate HNSW parameters and memory usage."""
# M: balance recall and memory
if self.N < 100000:
M = 16
elif self.N < 1000000:
M = 24
else:
M = 32
# ef_construction: quality during build
ef_construction = min(250, max(150, int(200 * (self.N / 1000000))))
# ef_search: quality during query (adjustable)
ef_search = 64 if self.target_qps > 100 else 128
# Memory estimation
bytes_per_vector = (self.D * 4 + M * 2 * 4)
memory_gb = (bytes_per_vector * self.N) / (1024**3)
# Expected QPS (rough estimate)
expected_qps = 200 if self.N < 1000000 else 30
return {
'algorithm': 'HNSW',
'M': M,
'ef_construction': ef_construction,
'ef_search': ef_search,
'memory_gb': round(memory_gb, 2),
'expected_qps': expected_qps,
'meets_qps': expected_qps >= self.target_qps
}
def calculate_ivf_pq_config(self):
"""Calculate IVF-PQ parameters and memory usage."""
# nlist: number of clusters
nlist = int(max(4 * (self.N ** 0.5), 1000))
# m: subvectors for PQ (must divide dimension)
m = 32 if self.D % 32 == 0 else 48 if self.D % 48 == 0 else 32
# nprobe: clusters to search
nprobe = max(16, nlist // 64)
# Memory estimation (PQ only, no full vectors)
bytes_per_vector = (8 // 8) * m + 24 # 8-bit PQ
memory_gb = (bytes_per_vector * self.N) / (1024**3)
# Expected QPS and recall
expected_qps = 50
expected_recall = 0.85
return {
'algorithm': 'IVF-PQ',
'nlist': nlist,
'm': m,
'nprobe': nprobe,
'memory_gb': round(memory_gb, 2),
'expected_qps': expected_qps,
'expected_recall': expected_recall,
'meets_qps': expected_qps >= self.target_qps
}
def recommend_sku(self):
"""Recommend Databricks SKU based on requirements."""
if self.N < 320_000_000 and self.target_latency < 100:
return {
'sku': 'Standard',
'latency': '20-50ms',
'qps_range': '30-200+',
'cost_efficiency': 'Good for less than 320M vectors'
}
elif self.N >= 10_000_000:
return {
'sku': 'Storage-Optimized',
'latency': '300-500ms',
'qps_range': '30-50',
'cost_efficiency': '7x cheaper per vector'
}
else:
return {
'sku': 'Standard (with caution)',
'latency': '50-100ms',
'qps_range': '30-100',
'note': 'Consider sharding or replication'
}
# Example usage
calculator = VectorSearchCalculator(
num_vectors=1_000_000,
dimension=768,
target_qps=100,
target_latency_ms=50
)
print("=== HNSW Configuration ===")
hnsw = calculator.calculate_hnsw_config()
for k, v in hnsw.items():
print(f"{k}: {v}")
print("\n=== IVF-PQ Configuration ===")
ivf = calculator.calculate_ivf_pq_config()
for k, v in ivf.items():
print(f"{k}: {v}")
print("\n=== Databricks SKU Recommendation ===")
sku = calculator.recommend_sku()
for k, v in sku.items():
print(f"{k}: {v}")

Vector search optimizer (dataset size, latency SLA → index parameters)

Interactive widget derived from “Vector Search Optimization: ANN Algorithms & Indexing Strategies” that lets readers explore vector search optimizer (dataset size, latency sla → index parameters).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.