Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking

A Google Cloud e-commerce customer reduced retrieval latency from 500ms to under 100ms while improving recall@100 to 95% by implementing multi-stage retrieval with two-tower architecture and semantic reranking. The pipeline retrieves 100 candidates via vector search in 20ms, reranks them in 80ms using Vertex AI Ranking API, and generates final answers in 150ms—total end-to-end latency of 350ms versus 2+ seconds with single-stage LLM reranking. This guide covers production-ready multi-stage retrieval patterns, cost optimization strategies, and implementation code for building high-performance RAG systems.

Why Multi-Stage Retrieval Matters

Single-stage retrieval forces an impossible tradeoff: vector search alone misses nuanced relevance, while LLM reranking every candidate is prohibitively expensive and slow. Multi-stage architectures solve this by combining the scalability of bi-encoders (vector search) with the precision of cross-encoders (reranking).

The Cost Problem with Single-Stage Reranking

Consider a knowledge base with 10 million documents. A naive approach using LLM reranking for all candidates would require:

10M LLM calls per query at $5/1M tokens = $50 per query
Latency: 10M × 10ms = 100,000 seconds (27 hours)
Practical impossibility for production systems

Multi-stage retrieval reduces this to:

Stage 1: Vector search retrieves top-100 candidates in 20ms
Stage 2: Rerank 100 candidates using specialized API (100ms) or LLM (1-2s)
Stage 3: Generate answer with top-10 candidates (150ms)
Total cost: $0.0005 per query (99.999% reduction)

When Multi-Stage Retrieval Is Essential

Required for:

Knowledge bases greater than 100K documents
Sub-500ms end-to-end latency requirements
High query volumes (greater than 10K queries/day)
Complex queries requiring nuanced relevance scoring

Optional for:

Small knowledge bases (less than 10K documents) where single-stage is sufficient
Batch processing where latency is not critical
Applications with simple keyword-based retrieval needs

Core Architecture Patterns

Pattern 1: Two-Tower Retrieval + Semantic Reranking

The two-tower architecture separates query and candidate encoding, enabling precomputation and independent scaling.

Why This Matters

Multi-stage retrieval directly impacts your bottom line. A 10M-document knowledge base with single-stage LLM reranking costs $50 per query and takes 27 hours. Multi-stage reduces this to $0.0005 per query and 350ms—a 99.999% cost reduction and 277,000x latency improvement.

The key insight: not all retrieval operations are equal. Vector search (bi-encoder) excels at scalable candidate generation but misses nuanced relevance. Cross-encoders capture fine-grained relevance but are O(k²) expensive. Multi-stage combines both: O(log n) retrieval + O(k²) reranking where k << n.

Cost Breakdown: Single-Stage vs. Multi-Stage

Stage	Single-Stage (LLM Rerank All)	Multi-Stage (Vector + Rerank Top-100)
Retrieval	10M LLM calls: $50/query	Vector search: $0.0002/query
Reranking	N/A	100 API calls: $0.0003/query
Generation	Included above	1 LLM call: $0.0001/query
Latency	27 hours	350ms
Total Cost	$50/query	$0.0006/query

Based on Vertex AI Pricing: Gemini 2.0 Flash ($0.10/$0.40 per 1M tokens), Ranking API (free tier)

When to Use Each Reranker

Vertex AI Ranking API (Semantic Reranker):

Latency: less than 100ms
Cost: Per-request pricing (free tier available)
Use case: Real-time applications, high query volume
Accuracy: State-of-the-art performance

LLM Reranker (Gemini):

Latency: 1-2 seconds
Cost: LLM token pricing ($0.10/$0.40 per 1M tokens)
Use case: Complex queries requiring nuanced understanding
Accuracy: Model-dependent, higher than semantic reranker

Practical Implementation

Architecture Decision Matrix

graph TD
    A[Query] --> B{Knowledge Base Size}
    B -->|less than 10K docs| C[Single-Stage Vector Search]
    B -->|10K - 1M docs| D[Multi-Stage: Vector + Semantic Rerank]
    B -->|> 1M docs| E[Two-Tower + Semantic Rerank]

    C --> F[Generate Answer]
    D --> G[Top-100 Candidates]
    E --> G
    G --> H[Semantic Reranking]
    H --> I[Top-10 Reranked]
    I --> F

Implementation Checklist

For Vertex AI RAG Engine Users:

Create RAG corpus with Vector Search backend
Configure RagRetrievalConfig with ranking.rank_service
Set top_k to 20-100 for retrieval
Enable ranking API via Discovery Engine API
Monitor latency: target less than 100ms for reranking stage

For Custom Two-Tower:

Train separate query and candidate encoders
Precompute candidate embeddings
Deploy to Vector Search index
Implement async reranking pipeline
Batch reranking requests for throughput

Cost Optimization Strategies

Precompute embeddings: Two-tower architecture amortizes training cost
Batch reranking: Process multiple queries in parallel
Tiered retrieval: Use keyword search for simple queries, vector for complex
Caching: Cache reranked results for identical queries
Model selection: Use smaller models for retrieval, larger for generation

Code Example

Production-Ready Vertex AI Multi-Stage Pipeline

from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
import time

# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
RAG_CORPUS_NAME = f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/your-corpus"
RANKER_MODEL = "semantic-ranker-default@latest"  # <100ms latency
LLM_MODEL = "gemini-2.0-flash"  # Fast generation

# Initialize
vertexai.init(project=PROJECT_ID, location=LOCATION)

# Stage 1: Configure retrieval with semantic reranking
# top_k=20 retrieves 20 candidates, then reranks them
config = rag.RagRetrievalConfig(
    top_k=20,
    ranking=rag.Ranking(
        rank_service=rag.RankService(model_name=RANKER_MODEL)
    )
)

# Stage 2: Create retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_resources=[rag.RagResource(rag_corpus=RAG_CORPUS_NAME)]
        ),
        rag_retrieval_config=config
    )
)

# Stage 3: Generate with LLM
model = GenerativeModel(
    model_name=LLM_MODEL,
    tools=[rag_retrieval_tool]
)

# Execute pipeline with timing
def execute_pipeline(query: str):
    start = time.time()

    # Vector search + reranking happens automatically
    response = model.generate_content(query)

    latency = time.time() - start
    print(f"Query: {query}")
    print(f"Latency: {latency:.2f}s")
    print(f"Answer: {response.text[:200]}...")

    return response

# Example usage
response = execute_pipeline(
    "What is the primary benefit of multi-stage retrieval?"
)

Advanced: Two-Tower with Custom Reranking

import asyncio
from typing import List, Dict, Tuple
import numpy as np

class MultiStagePipeline:
    def __init__(self, vector_index, rerank_client, llm_client):
        self.vector_index = vector_index  # Vertex AI Vector Search
        self.rerank_client = rerank_client  # Ranking API or LLM
        self.llm_client = llm_client

    async def retrieve_stage(self, query: str, top_k: int = 100) -> List[Dict]:
        """Stage 1: Fast vector search"""
        # Encode query
        query_embedding = await self.llm_client.embeddings.create(
            model="text-embedding-3-large",
            input=query
        )

        # Vector search: O(log n) complexity
        candidates = self.vector_index.search(
            query_embedding.data[0].embedding,
            k=top_k
        )
        return candidates

    async def rerank_stage(self, query: str, candidates: List[Dict]) -> List[Tuple[Dict, float]]:
        """Stage 2: Accurate reranking"""
        # Vertex AI Ranking API: <100ms
        # Or LLM reranker: 1-2s for higher accuracy

        # For Vertex AI Ranking API:
        response = await self.rerank_client.ranking_configs.rank(
            ranking_config="projects/{project}/locations/global/rankingConfigs/default_ranking_config",
            model="semantic-ranker-default@latest",
            query=query,
            records=[{
                "id": c["id"],
                "title": c.get("title", ""),
                "content": c["content"]
            } for c in candidates]
        )

        # Return sorted by score
        return [(c, r.score) for c, r in zip(candidates, response.records)]

    async def generate_stage(self, query: str, top_reranked: List[Dict]) -> str:
        """Stage 3: Final generation"""
        context = "\n\n".join([c["content"] for c in top_reranked[:3]])

        prompt = f"""
        Answer based on context:

        Context:
        {context}

        Query: {query}
        """

        response = await self.llm_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )

        return response.choices[0].message.content

    async def execute(self, query: str) -> str:
        """Execute full pipeline"""
        # Stage 1: Retrieve 100 candidates (20ms)
        candidates = await self.retrieve_stage(query, top_k=100)

        # Stage 2: Rerank top 100 (100ms)
        reranked = await self.rerank_stage(query, candidates)
        top_10 = [c[0] for c in reranked[:10]]

        # Stage 3: Generate (150ms)
        answer = await self.generate_stage(query, top_10)

        return answer

# Usage
async def main():
    pipeline = MultiStagePipeline(
        vector_index=vertex_ai_vector_search_index,
        rerank_client=vertex_ai_ranking_api,
        llm_client=openai
    )

Common Pitfalls

Multi-stage retrieval introduces new failure modes that don’t exist in single-stage systems. Here are the most critical pitfalls to avoid:

1. Reranking Too Many Candidates

The Mistake: Retrieving 1000+ candidates and reranking all of them.

Why It Fails:

Cost explosion: Vertex AI Ranking API charges per request; 1000 candidates = 1000 requests
Latency death: 1000 × 100ms = 100 seconds (semantic reranking) or 1000 × 2s = 2000 seconds (LLM reranking)
Diminishing returns: Top-100 candidates typically contain 95%+ of relevant documents

The Fix: Limit reranking to 50-200 candidates. Use this formula:

# Optimal candidate count based on corpus size
def optimal_rerank_candidates(corpus_size: int) -> int:
    if corpus_size < 10_000:
        return 10  # Small corpus, minimal reranking needed
    elif corpus_size < 1_000_000:
        return 50  # Medium corpus, moderate reranking
    else:
        return 100  # Large corpus, comprehensive reranking

# Never exceed 200 candidates for cost/latency reasons
MAX_RERANK_CANDIDATES = 200

2. Ignoring Latency Budgets

The Mistake: Adding reranking without accounting for cumulative latency.

Real-World Impact:

Target: 500ms end-to-end latency
Vector search: 20ms
Semantic reranking: 100ms
LLM generation: 150ms
Total: 270ms (within budget)
With LLM reranking: 20ms + 2000ms + 150ms = 2170ms (4.3x over budget)

The Fix: Choose reranker based on latency budget:

Latency Budget	Reranker Choice	Expected Accuracy
less than 200ms	No reranking or cached results	Baseline
200-500ms	Vertex AI Ranking API	High
500ms-2s	LLM reranker (small model)	Very High
greater than 2s	LLM reranker (large model)	Maximum

3. Not Precomputing Embeddings in Two-Tower

The Mistake: Computing candidate embeddings on-the-fly for every query.

Why It Fails:

Redundant computation: Same documents embedded repeatedly
Latency spike: 10M documents × 10ms/embedding = 27 hours per query
Cost waste: Paying for same inference repeatedly

The Fix: Precompute and cache candidate embeddings:

# BAD: On-the-fly embedding
async def bad_retrieval(query: str):
    query_emb = await embed(query)
    candidates = await fetch_documents(query)  # 10M docs
    candidate_embs = await asyncio.gather(*[embed(c) for c in candidates])  # Expensive!
    return vector_search(query_emb, candidate_embs)

# GOOD: Precomputed embeddings
class TwoTowerRetriever:
    def __init__(self, vector_index):
        self.index = vector_index  # Prebuilt with candidate embeddings

    async def good_retrieval(self, query: str):
        query_emb = await embed(query)  # Only embed query
        return self.index.search(query_emb, k=100)  # Instant retrieval

4. Using Bi-Encoder for Reranking

The Mistake: Using the same embedding model for both retrieval and reranking.

Why It Fails:

Misses nuanced relevance: Bi-encoders encode query and document separately; cross-encoders see them together
No interaction signal: Cross-encoders capture query-document interaction features
Suboptimal ranking: Top-100 from bi-encoder may not contain best answers

The Fix: Use cross-encoder for reranking:

# Bi-encoder (good for retrieval, bad for reranking)
# Query: "AI infrastructure costs"
# Doc: "GPU pricing is $3/hour"
# Similarity: Moderate (different words)

# Cross-encoder (good for reranking)
# Input: "[AI infrastructure costs] [GPU pricing is $3/hour]"
# Output: High score (recognizes "costs" ↔ "pricing" relationship)

5. Missing Request Batching

The Mistake: Sending reranking requests one-by-one.

Why It Fails:

Throughput bottleneck: 100 QPS with single requests vs. 1000+ QPS with batching
Latency overhead: Network round-trips add 10-50ms per request
Cost inefficiency: API charges per request; batching reduces overhead

The Fix: Batch reranking requests:

# Sequential (slow)
async def rerank_sequential(query: str, candidates: List[Dict]):
    results = []
    for candidate in candidates:
        result = await ranking_api.rerank(query, candidate)  # 100ms each
        results.append(result)
    return results  # 1000ms total

# Parallel (fast)
async def rerank_parallel(query: str, candidates: List[Dict]):
    tasks = [ranking_api.rerank(query, c) for c in candidates]
    results = await asyncio.gather(*tasks)  # 100ms total
    return results

6. Not Evaluating Recall vs. Latency Tradeoff

The Mistake: Using default ANN parameters without tuning.

Why It Fails:

Recall@100: 70% (default) vs. 95% (tuned)
Latency: 20ms (default) vs. 50ms (tuned)
Result: Missing 25% of relevant documents for marginal latency gain

The Fix: Tune ANN parameters:

# Vertex AI Vector Search tuning
index_config = {
    "approximate_neighbors_count": 150,  # Increase for higher recall
    "leaf_nodes_to_search_percent": 7,   # Increase for higher recall
    "distance_measure_type": "DOT_PRODUCT_DISTANCE"
}

# Evaluate on validation set
# Target: >90% recall@100 with <50ms latency

7. LLM Reranking Without Few-Shot Examples

The Mistake: Using zero-shot LLM prompts for reranking.

Why It Fails:

Inconsistent scoring: LLMs give different scores for semantically similar pairs
No calibration: Without examples, scores aren’t comparable across candidates
Poor accuracy: 20-30% worse than few-shot prompting

The Fix: Use few-shot examples:

# Zero-shot (inconsistent)
prompt = f"Rate relevance (0-1): Query='{query}' Doc='{doc}'"

# Few-shot (consistent)
prompt = f"""Rate relevance (0-1) of document to query.

Examples:
Query: "What is AI?" Doc: "Artificial intelligence..." → 0.95
Query: "What is AI?" Doc: "Machine learning..." → 0.85
Query: "What is AI?" Doc: "Cloud computing..." → 0.30

Query: "{query}"
Doc: "{doc}"
Score:"""

8. Cold-Start Problem with New Items

The Mistake: New documents can’t be reranked because they lack embeddings.

Why It Fails:

Freshness gap: New items invisible to retrieval system
Manual intervention: Requires reprocessing entire index
User frustration: “Why isn’t my new document showing up?”

The Fix: Two-tower architecture with streaming updates:

# Vertex AI Vector Search with STREAM_UPDATE
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="rag_index",
    contents_delta_uri="gs://bucket/new_embeddings.json",
    index_update_method="STREAM_UPDATE"  # Real-time updates
)

# New document pipeline
async def add_new_document(doc: Dict):
    # 1. Generate embedding via candidate tower
    embedding = await candidate_tower.encode(doc)

    # 2. Write to GCS in correct format
    await write_embeddings_to_gcs([{"id": doc["id"], "embedding": embedding}])

    # 3. Vector Search automatically picks up changes
    # No retraining needed!

9. Ignoring Cost Scaling at High Volume

The Mistake: LLM reranking costs explode at scale.

Cost Projection (1M queries/day):

Reranker	Cost/Query	Daily Cost	Monthly Cost
No reranking	$0.0001	$100	$3,000
Semantic API	$0.0003	$300	$9,000
LLM Reranker	$0.05	$50,000	$1,500,000

The Fix: Use semantic reranking for high-volume scenarios.

Reranking strategy cost-benefit calculator

Interactive widget derived from “Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking” that lets readers explore reranking strategy cost-benefit calculator.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.