Vector Search Optimization: ANN Algorithms & Indexing Strategies

A production RAG pipeline at a Series C startup was delivering 450ms query latency—violating their 100ms SLA and costing them enterprise deals. The culprit wasn’t their embedding model or LLM inference time. It was a brute-force vector search scanning 2.3M documents on every query. After implementing HNSW indexing with proper parameter tuning, latency dropped to 28ms and they scaled to 500 QPS on the same infrastructure. This guide covers the approximate nearest neighbor (ANN) algorithms, index tuning strategies, and sharding patterns that separate production-ready vector search from toy demos.

Why This Matters

Vector search performance directly impacts user experience and infrastructure costs. A poorly configured index can 10x your compute bill while delivering unacceptable latency. According to Databricks’ performance benchmarks, standard vector search SKUs plateau at ~30 QPS when scaling beyond a single search unit, while optimized configurations can deliver 200+ QPS with less than 50ms latency.

The cost implications are significant. Using brute-force search (exact k-NN) for datasets exceeding 20,000 filtered documents becomes economically unviable. Each query requires O(n) comparisons, meaning 1M vectors with 768 dimensions translates to 768MB of data scanned per query. At 100 QPS, that’s 76.8GB/second of memory bandwidth—translating to expensive GPU or high-end CPU instances.

For AI engineering teams, understanding ANN algorithms isn’t optional. It’s the difference between a system that scales to millions of queries at reasonable cost versus one that requires constant infrastructure upgrades and delivers poor user experience.

Understanding ANN Algorithms

Approximate Nearest Neighbor algorithms trade perfect accuracy for orders-of-magnitude speed improvements. Instead of comparing every vector, they use clever data structures to prune the search space dramatically.

HNSW (Hierarchical Navigable Small World)

HNSW is the current gold standard for recall-latency tradeoffs. It builds a multi-layer graph where each layer is a subset of the previous, allowing logarithmic-time search.

How it works:

Construction: Build a graph where nodes (vectors) connect to their nearest neighbors
Search: Start at a coarse entry point, greedily traverse to closer neighbors
Layers: Multiple layers allow fast coarse navigation (top layers) followed by fine-grained search (bottom layers)

Key parameters:

M (4-64): Number of bi-directional links per node. Higher M = better accuracy but more memory. Databricks recommends M=16-32 for most use cases.
ef_construction (100-400): Size of candidate list during construction. Higher = better index quality but slower build time.
ef_search (64-512): Search-time width. Higher = better recall but slower queries.

Memory usage: Approximately (dimension * 4 + M * 2 * 4) bytes per vector. For 768-dim vectors with M=16: ~3.1KB per vector.

IVF-PQ (Inverted File + Product Quantization)

IVF-PQ is the memory champion. It partitions vectors into clusters (inverted files) and compresses vectors using product quantization.

How it works:

IVF: Partition space into nlist clusters using k-means
PQ: Decompose vectors into m subvectors and compress each to code_size bits
Search: Only search nprobe clusters, then compute approximate distances using quantized codes

Key parameters:

nlist (4√N to 16√N): Number of clusters. For 1M vectors, nlist=1024-4096
m: Subvectors (must divide dimension). Common: 32 for 256-dim, 48 for 768-dim
code_size (4-8 bits): Bits per subvector. 8-bit = good balance
nprobe (nlist/64 to nlist/16): Clusters to search. Start with nprobe = nlist/64

Memory savings: 768-dim float32 = 3KB. 768-dim PQ (m=48, 8-bit) = 48 bytes. ~60x reduction.

ScaNN (Scalable Nearest Neighbors)

ScaNN is Google’s ANN library that excels at anisotropic vector compression. It optimizes for the specific distribution of your data rather than using uniform quantization.

Key features:

Anisotropic quantization: Weights different parts of the vector space based on data density
Voronoi diagram: More sophisticated partitioning than simple k-means
GPU acceleration: Optimized for modern TPUs and GPUs

ScaNN is particularly effective when your vectors have non-uniform density or when you need to optimize for a specific recall target.

Index Tuning Strategies

HNSW Parameter Optimization

The relationship between parameters is non-linear. Small changes can dramatically affect both performance and recall.

Construction vs. Search Tradeoffs:

High ef_construction + High M: Best recall, slow builds, large memory
Low ef_construction + Low M: Fast builds, poor recall, smaller memory
Balanced: M=16-24, ef_construction=150-250 for most production workloads

Search Parameter Tuning:

Practical Implementation

Building production vector search requires more than understanding theory—you need battle-tested patterns for deployment, monitoring, and scaling. The following sections cover implementation strategies that work at scale.

Production Deployment Checklist

Before going live, verify these critical configurations:

Index Construction:

✅ Pre-train IVF-PQ indexes on 10% of your dataset (minimum 1,000 vectors)
✅ Set HNSW M between 16-32 for optimal recall-latency balance
✅ Configure ef_construction 150-250 for quality index builds
✅ Validate vector normalization when using cosine similarity
✅ Test recall@k on holdout data before production deployment

Infrastructure:

✅ Use service principals with OAuth instead of PATs (saves 100ms+ latency)
✅ Enable circuit breakers (e.g., knn.memory.circuit_breaker.limit=60 in OpenSearch)
✅ Configure proper shard count based on data size and QPS requirements
✅ Set up monitoring for latency, QPS, and recall metrics
✅ Plan for query spikes with exponential backoff and retry logic

Query Optimization:

✅ Reuse index objects across queries (avoid reinitialization overhead)
✅ Set num_results in 10-100 range unless application requires more
✅ Use ANN queries unless hybrid search is explicitly needed
✅ Implement batch processing for insertions and queries
✅ Cache frequently accessed indexes in memory

Scaling Strategies

Horizontal Scaling: When single-instance performance plateaus, use these patterns:

Split indexes across endpoints: Host multiple indexes on separate endpoints to increase total bandwidth
Replicate indexes: Duplicate high-traffic indexes across endpoints and split traffic evenly
User-defined sharding: Partition by tenant_id or region to reduce cross-node traffic
Multitenancy optimization: Use logical partitioning with payload filters instead of separate collections

Vertical Scaling:

Standard SKU: Best for less than 320M vectors, 20-50ms latency, 30-200+ QPS
Storage-optimized SKU: Best for 10M+ vectors, 300-500ms latency, 7x cost savings per vector

Code Example

Complete HNSW Implementation with Memory Optimization

This example demonstrates a production-ready HNSW index with parameter tuning, memory estimation, and error handling.

import faiss
import numpy as np
import time
from typing import Tuple, List, Dict

class OptimizedVectorSearch:
    """
    Production-ready vector search with HNSW indexing.
    Optimized for sub-50ms latency at scale.
    """

    def __init__(self, dimension: int, max_elements: int = 1000000):
        self.dimension = dimension
        self.max_elements = max_elements

        # HNSW parameters tuned for production
        # M=16 provides good recall with reasonable memory
        self.M = 16
        self.ef_construction = 200  # Quality during build
        self.ef_search = 64         # Quality during query

        self.index = None
        self.is_trained = False

    def create_index(self) -> faiss.IndexHNSWFlat:
        """Create HNSW index with optimized parameters."""
        try:
            # Initialize HNSW index
            index = faiss.IndexHNSWFlat(self.dimension, self.M)

            # Set construction parameters
            index.hnsw.efConstruction = self.ef_construction

            # Set search parameters (can be adjusted per query)
            index.hnsw.efSearch = self.ef_search

            # Memory estimation
            bytes_per_vector = (self.dimension * 4 + self.M * 2 * 4)
            memory_gb = (bytes_per_vector * self.max_elements) / (1024**3)

            print(f"✓ HNSW Index created:")
            print(f"  Dimension: {self.dimension}")
            print(f"  M (links/node): {self.M}")
            print(f"  ef_construction: {self.ef_construction}")
            print(f"  ef_search: {self.ef_search}")
            print(f"  Max elements: {self.max_elements:,}")
            print(f"  Estimated memory: {memory_gb:.2f} GB")

            self.index = index
            self.is_trained = True
            return index

        except Exception as e:
            print(f"✗ Index creation failed: {e}")
            raise

    def add_vectors(self, vectors: np.ndarray, batch_size: int = 50000) -> None:
        """Add vectors in batches to avoid memory issues."""
        if not self.is_trained:
            raise ValueError("Index not initialized")

        if vectors.shape[0] > self.max_elements:
            raise ValueError(f"Vector count {vectors.shape[0]} exceeds max_elements {self.max_elements}")

        start_time = time.time()
        total_added = 0

        for i in range(0, len(vectors), batch_size):
            batch = vectors[i:i+batch_size]
            self.index.add(batch.astype('float32'))
            total_added += len(batch)

            if i % 100000 == 0 and i > 0:
                print(f"  Progress: {total_added:,} vectors added")

        elapsed = time.time() - start_time
        print(f"✓ Added {total_added:,} vectors in {elapsed:.2f}s "
              f"({total_added/elapsed:.0f} vectors/sec)")

    def search(self, query: np.ndarray, k: int = 10,
               ef_search: int = None) -> Tuple[np.ndarray, np.ndarray]:
        """
        Search for k nearest neighbors.

        Args:
            query: Query vector (1, dimension)
            k: Number of results
            ef_search: Search width (higher = better recall, slower)

        Returns:
            distances, indices
        """
        if not self.is_trained:
            raise ValueError("Index not initialized")

        # Adjust search width if needed
        if ef_search:
            self.index.hnsw.efSearch = ef_search

        # Validate query
        if query.shape[1] != self.dimension:
            raise ValueError(f"Query dimension mismatch: expected {self.dimension}")

        start_time = time.time()
        distances, indices = self.index.search(query.astype('float32'), k)
        elapsed = time.time() - start_time

        print(f"Query: {elapsed*1000:.2f}ms, k={k}, ef={self.index.hnsw.efSearch}")

        return distances, indices

    def tune_for_recall(self, test_queries: np.ndarray,
                        test_neighbors: np.ndarray,
                        target_recall: float = 0.95) -> Dict:
        """
        Auto-tune ef_search to achieve target recall.
        Requires ground truth neighbors for validation.
        """
        print(f"\nTuning for {target_recall:.0%} recall...")

        best_ef = None
        best_recall = 0

        for ef in [32, 64, 128, 256, 512]:
            self.index.hnsw.efSearch = ef

            correct = 0
            for i, query in enumerate(test_queries):
                query = query.reshape(1, -1)
                _, indices = self.search(query, k=test_neighbors.shape[1])

                # Check if ground truth neighbors are in results
                if test_neighbors[i][0] in indices[0]:
                    correct += 1

            recall = correct / len(test_queries)
            print(f"  ef={ef}: recall={recall:.3f}")

            if recall >= target_recall and recall > best_recall:
                best_ef = ef
                best_recall = recall

            if recall >= target_recall:
                break

        if best_ef:
            self.index.hnsw.efSearch = best_ef
            print(f"✓ Optimal ef_search: {best_ef} (recall: {best_recall:.3f})")
        else:
            print("⚠ Could not achieve target recall with available ef values")

        return {'optimal_ef': best_ef, 'recall': best_recall}

# Production usage example
if __name__ == "__main__":
    # Configuration for 1M vectors, 768-dim
    DIM = 768
    N_VECTORS = 100000  # Demo with 100k

    print("="*60)
    print("PRODUCTION VECTOR SEARCH SETUP")
    print("="*60)

    # Initialize
    search = OptimizedVectorSearch(dimension=DIM, max_elements=N_VECTORS)
    search.create_index()

    # Generate sample data
    print("\nGenerating sample vectors...")
    np.random.seed(42)
    vectors = np.random.random((N_VECTORS, DIM)).astype('float32')

    # Add vectors
    print("\nAdding vectors to index...")
    search.add_vectors(vectors)

    # Search
    print("\n" + "="*60)
    print("SEARCH PERFORMANCE")
    print("="*60)

    query = np.random.random((1, DIM)).astype('float32')

Common Pitfalls

Avoiding these mistakes is often more impactful than tuning parameters. These pitfalls are compiled from production vector search deployments and performance analyses.

Quick Reference

Parameter Tuning Cheat Sheet

Algorithm	Parameter	Recommended Range	Impact
HNSW	M	16-32	Higher = better recall, more memory
	ef_construction	150-250	Higher = better index quality, slower build
	ef_search	64-512	Higher = better recall, slower queries
IVF-PQ	nlist	4√N to 16√N	More clusters = faster search, lower recall
	nprobe	nlist/64 to nlist/16	Higher = better recall, slower queries
	m	dimension/8 to dimension/4	Must divide dimension
	code_size	8 bits	Balance between memory and accuracy

Memory Estimation Formulas

HNSW: (dimension * 4 + M * 2 * 4) bytes per vector
Example: 768-dim, M=16 → ~3.1KB/vector → 3.1GB for 1M vectors

IVF-PQ: (code_size / 8) * m + 24 bytes per vector
Example: 768-dim, m=48, 8-bit → 48 bytes/vector → 48MB for 1M vectors

Performance Targets

Metric	Target	Notes
Latency (p95)	less than 50ms	Including embedding generation if applicable
Recall@10	greater than 0.95	Adjust based on application requirements
QPS Capacity	100-200+	Depends on SKU and infrastructure
Memory Budget	less than 50% RAM	Leave headroom for OS and caching

Vector Search Configuration Calculator

Use this interactive calculator to estimate your index configuration and resource requirements.

class VectorSearchCalculator:
    """
    Calculate optimal index configuration and resource requirements
    based on your dataset and performance targets.
    """

    def __init__(self, num_vectors, dimension, target_qps, target_latency_ms):
        self.N = num_vectors
        self.D = dimension
        self.target_qps = target_qps
        self.target_latency = target_latency_ms

    def calculate_hnsw_config(self):
        """Calculate HNSW parameters and memory usage."""
        # M: balance recall and memory
        if self.N < 100000:
            M = 16
        elif self.N < 1000000:
            M = 24
        else:
            M = 32

        # ef_construction: quality during build
        ef_construction = min(250, max(150, int(200 * (self.N / 1000000))))

        # ef_search: quality during query (adjustable)
        ef_search = 64 if self.target_qps > 100 else 128

        # Memory estimation
        bytes_per_vector = (self.D * 4 + M * 2 * 4)
        memory_gb = (bytes_per_vector * self.N) / (1024**3)

        # Expected QPS (rough estimate)
        expected_qps = 200 if self.N < 1000000 else 30

        return {
            'algorithm': 'HNSW',
            'M': M,
            'ef_construction': ef_construction,
            'ef_search': ef_search,
            'memory_gb': round(memory_gb, 2),
            'expected_qps': expected_qps,
            'meets_qps': expected_qps >= self.target_qps
        }

    def calculate_ivf_pq_config(self):
        """Calculate IVF-PQ parameters and memory usage."""
        # nlist: number of clusters
        nlist = int(max(4 * (self.N ** 0.5), 1000))

        # m: subvectors for PQ (must divide dimension)
        m = 32 if self.D % 32 == 0 else 48 if self.D % 48 == 0 else 32

        # nprobe: clusters to search
        nprobe = max(16, nlist // 64)

        # Memory estimation (PQ only, no full vectors)
        bytes_per_vector = (8 // 8) * m + 24  # 8-bit PQ
        memory_gb = (bytes_per_vector * self.N) / (1024**3)

        # Expected QPS and recall
        expected_qps = 50
        expected_recall = 0.85

        return {
            'algorithm': 'IVF-PQ',
            'nlist': nlist,
            'm': m,
            'nprobe': nprobe,
            'memory_gb': round(memory_gb, 2),
            'expected_qps': expected_qps,
            'expected_recall': expected_recall,
            'meets_qps': expected_qps >= self.target_qps
        }

    def recommend_sku(self):
        """Recommend Databricks SKU based on requirements."""
        if self.N < 320_000_000 and self.target_latency < 100:
            return {
                'sku': 'Standard',
                'latency': '20-50ms',
                'qps_range': '30-200+',
                'cost_efficiency': 'Good for less than 320M vectors'
            }
        elif self.N >= 10_000_000:
            return {
                'sku': 'Storage-Optimized',
                'latency': '300-500ms',
                'qps_range': '30-50',
                'cost_efficiency': '7x cheaper per vector'
            }
        else:
            return {
                'sku': 'Standard (with caution)',
                'latency': '50-100ms',
                'qps_range': '30-100',
                'note': 'Consider sharding or replication'
            }

# Example usage
calculator = VectorSearchCalculator(
    num_vectors=1_000_000,
    dimension=768,
    target_qps=100,
    target_latency_ms=50
)

print("=== HNSW Configuration ===")
hnsw = calculator.calculate_hnsw_config()
for k, v in hnsw.items():
    print(f"{k}: {v}")

print("\n=== IVF-PQ Configuration ===")
ivf = calculator.calculate_ivf_pq_config()
for k, v in ivf.items():
    print(f"{k}: {v}")

print("\n=== Databricks SKU Recommendation ===")
sku = calculator.recommend_sku()
for k, v in sku.items():
    print(f"{k}: {v}")

Vector search optimizer (dataset size, latency SLA → index parameters)

Interactive widget derived from “Vector Search Optimization: ANN Algorithms & Indexing Strategies” that lets readers explore vector search optimizer (dataset size, latency sla → index parameters).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.