Skip to content
GitHubX/TwitterRSS

Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking

Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking

Section titled “Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking”

A Google Cloud e-commerce customer reduced retrieval latency from 500ms to under 100ms while improving recall@100 to 95% by implementing multi-stage retrieval with two-tower architecture and semantic reranking. The pipeline retrieves 100 candidates via vector search in 20ms, reranks them in 80ms using Vertex AI Ranking API, and generates final answers in 150ms—total end-to-end latency of 350ms versus 2+ seconds with single-stage LLM reranking. This guide covers production-ready multi-stage retrieval patterns, cost optimization strategies, and implementation code for building high-performance RAG systems.

Single-stage retrieval forces an impossible tradeoff: vector search alone misses nuanced relevance, while LLM reranking every candidate is prohibitively expensive and slow. Multi-stage architectures solve this by combining the scalability of bi-encoders (vector search) with the precision of cross-encoders (reranking).

The Cost Problem with Single-Stage Reranking

Section titled “The Cost Problem with Single-Stage Reranking”

Consider a knowledge base with 10 million documents. A naive approach using LLM reranking for all candidates would require:

  • 10M LLM calls per query at $5/1M tokens = $50 per query
  • Latency: 10M × 10ms = 100,000 seconds (27 hours)
  • Practical impossibility for production systems

Multi-stage retrieval reduces this to:

  • Stage 1: Vector search retrieves top-100 candidates in 20ms
  • Stage 2: Rerank 100 candidates using specialized API (100ms) or LLM (1-2s)
  • Stage 3: Generate answer with top-10 candidates (150ms)
  • Total cost: $0.0005 per query (99.999% reduction)

Required for:

  • Knowledge bases greater than 100K documents
  • Sub-500ms end-to-end latency requirements
  • High query volumes (greater than 10K queries/day)
  • Complex queries requiring nuanced relevance scoring

Optional for:

  • Small knowledge bases (less than 10K documents) where single-stage is sufficient
  • Batch processing where latency is not critical
  • Applications with simple keyword-based retrieval needs

Pattern 1: Two-Tower Retrieval + Semantic Reranking

Section titled “Pattern 1: Two-Tower Retrieval + Semantic Reranking”

The two-tower architecture separates query and candidate encoding, enabling precomputation and independent scaling.

Multi-stage retrieval directly impacts your bottom line. A 10M-document knowledge base with single-stage LLM reranking costs $50 per query and takes 27 hours. Multi-stage reduces this to $0.0005 per query and 350ms—a 99.999% cost reduction and 277,000x latency improvement.

The key insight: not all retrieval operations are equal. Vector search (bi-encoder) excels at scalable candidate generation but misses nuanced relevance. Cross-encoders capture fine-grained relevance but are O(k²) expensive. Multi-stage combines both: O(log n) retrieval + O(k²) reranking where k << n.

Cost Breakdown: Single-Stage vs. Multi-Stage

Section titled “Cost Breakdown: Single-Stage vs. Multi-Stage”
StageSingle-Stage (LLM Rerank All)Multi-Stage (Vector + Rerank Top-100)
Retrieval10M LLM calls: $50/queryVector search: $0.0002/query
RerankingN/A100 API calls: $0.0003/query
GenerationIncluded above1 LLM call: $0.0001/query
Latency27 hours350ms
Total Cost$50/query$0.0006/query

Based on Vertex AI Pricing: Gemini 2.0 Flash ($0.10/$0.40 per 1M tokens), Ranking API (free tier)

Vertex AI Ranking API (Semantic Reranker):

  • Latency: less than 100ms
  • Cost: Per-request pricing (free tier available)
  • Use case: Real-time applications, high query volume
  • Accuracy: State-of-the-art performance

LLM Reranker (Gemini):

  • Latency: 1-2 seconds
  • Cost: LLM token pricing ($0.10/$0.40 per 1M tokens)
  • Use case: Complex queries requiring nuanced understanding
  • Accuracy: Model-dependent, higher than semantic reranker
graph TD
A[Query] --> B{Knowledge Base Size}
B -->|less than 10K docs| C[Single-Stage Vector Search]
B -->|10K - 1M docs| D[Multi-Stage: Vector + Semantic Rerank]
B -->|> 1M docs| E[Two-Tower + Semantic Rerank]
C --> F[Generate Answer]
D --> G[Top-100 Candidates]
E --> G
G --> H[Semantic Reranking]
H --> I[Top-10 Reranked]
I --> F

For Vertex AI RAG Engine Users:

  1. Create RAG corpus with Vector Search backend
  2. Configure RagRetrievalConfig with ranking.rank_service
  3. Set top_k to 20-100 for retrieval
  4. Enable ranking API via Discovery Engine API
  5. Monitor latency: target less than 100ms for reranking stage

For Custom Two-Tower:

  1. Train separate query and candidate encoders
  2. Precompute candidate embeddings
  3. Deploy to Vector Search index
  4. Implement async reranking pipeline
  5. Batch reranking requests for throughput
  1. Precompute embeddings: Two-tower architecture amortizes training cost
  2. Batch reranking: Process multiple queries in parallel
  3. Tiered retrieval: Use keyword search for simple queries, vector for complex
  4. Caching: Cache reranked results for identical queries
  5. Model selection: Use smaller models for retrieval, larger for generation

Production-Ready Vertex AI Multi-Stage Pipeline

Section titled “Production-Ready Vertex AI Multi-Stage Pipeline”
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
import time
# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
RAG_CORPUS_NAME = f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/your-corpus"
RANKER_MODEL = "semantic-ranker-default@latest" # <100ms latency
LLM_MODEL = "gemini-2.0-flash" # Fast generation
# Initialize
vertexai.init(project=PROJECT_ID, location=LOCATION)
# Stage 1: Configure retrieval with semantic reranking
# top_k=20 retrieves 20 candidates, then reranks them
config = rag.RagRetrievalConfig(
top_k=20,
ranking=rag.Ranking(
rank_service=rag.RankService(model_name=RANKER_MODEL)
)
)
# Stage 2: Create retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_resources=[rag.RagResource(rag_corpus=RAG_CORPUS_NAME)]
),
rag_retrieval_config=config
)
)
# Stage 3: Generate with LLM
model = GenerativeModel(
model_name=LLM_MODEL,
tools=[rag_retrieval_tool]
)
# Execute pipeline with timing
def execute_pipeline(query: str):
start = time.time()
# Vector search + reranking happens automatically
response = model.generate_content(query)
latency = time.time() - start
print(f"Query: {query}")
print(f"Latency: {latency:.2f}s")
print(f"Answer: {response.text[:200]}...")
return response
# Example usage
response = execute_pipeline(
"What is the primary benefit of multi-stage retrieval?"
)
import asyncio
from typing import List, Dict, Tuple
import numpy as np
class MultiStagePipeline:
def __init__(self, vector_index, rerank_client, llm_client):
self.vector_index = vector_index # Vertex AI Vector Search
self.rerank_client = rerank_client # Ranking API or LLM
self.llm_client = llm_client
async def retrieve_stage(self, query: str, top_k: int = 100) -> List[Dict]:
"""Stage 1: Fast vector search"""
# Encode query
query_embedding = await self.llm_client.embeddings.create(
model="text-embedding-3-large",
input=query
)
# Vector search: O(log n) complexity
candidates = self.vector_index.search(
query_embedding.data[0].embedding,
k=top_k
)
return candidates
async def rerank_stage(self, query: str, candidates: List[Dict]) -> List[Tuple[Dict, float]]:
"""Stage 2: Accurate reranking"""
# Vertex AI Ranking API: <100ms
# Or LLM reranker: 1-2s for higher accuracy
# For Vertex AI Ranking API:
response = await self.rerank_client.ranking_configs.rank(
ranking_config="projects/{project}/locations/global/rankingConfigs/default_ranking_config",
model="semantic-ranker-default@latest",
query=query,
records=[{
"id": c["id"],
"title": c.get("title", ""),
"content": c["content"]
} for c in candidates]
)
# Return sorted by score
return [(c, r.score) for c, r in zip(candidates, response.records)]
async def generate_stage(self, query: str, top_reranked: List[Dict]) -> str:
"""Stage 3: Final generation"""
context = "\n\n".join([c["content"] for c in top_reranked[:3]])
prompt = f"""
Answer based on context:
Context:
{context}
Query: {query}
"""
response = await self.llm_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def execute(self, query: str) -> str:
"""Execute full pipeline"""
# Stage 1: Retrieve 100 candidates (20ms)
candidates = await self.retrieve_stage(query, top_k=100)
# Stage 2: Rerank top 100 (100ms)
reranked = await self.rerank_stage(query, candidates)
top_10 = [c[0] for c in reranked[:10]]
# Stage 3: Generate (150ms)
answer = await self.generate_stage(query, top_10)
return answer
# Usage
async def main():
pipeline = MultiStagePipeline(
vector_index=vertex_ai_vector_search_index,
rerank_client=vertex_ai_ranking_api,
llm_client=openai
)

Multi-stage retrieval introduces new failure modes that don’t exist in single-stage systems. Here are the most critical pitfalls to avoid:

The Mistake: Retrieving 1000+ candidates and reranking all of them.

Why It Fails:

  • Cost explosion: Vertex AI Ranking API charges per request; 1000 candidates = 1000 requests
  • Latency death: 1000 × 100ms = 100 seconds (semantic reranking) or 1000 × 2s = 2000 seconds (LLM reranking)
  • Diminishing returns: Top-100 candidates typically contain 95%+ of relevant documents

The Fix: Limit reranking to 50-200 candidates. Use this formula:

# Optimal candidate count based on corpus size
def optimal_rerank_candidates(corpus_size: int) -> int:
if corpus_size < 10_000:
return 10 # Small corpus, minimal reranking needed
elif corpus_size < 1_000_000:
return 50 # Medium corpus, moderate reranking
else:
return 100 # Large corpus, comprehensive reranking
# Never exceed 200 candidates for cost/latency reasons
MAX_RERANK_CANDIDATES = 200

The Mistake: Adding reranking without accounting for cumulative latency.

Real-World Impact:

  • Target: 500ms end-to-end latency

  • Vector search: 20ms

  • Semantic reranking: 100ms

  • LLM generation: 150ms

  • Total: 270ms (within budget)

  • With LLM reranking: 20ms + 2000ms + 150ms = 2170ms (4.3x over budget)

The Fix: Choose reranker based on latency budget:

Latency BudgetReranker ChoiceExpected Accuracy
less than 200msNo reranking or cached resultsBaseline
200-500msVertex AI Ranking APIHigh
500ms-2sLLM reranker (small model)Very High
greater than 2sLLM reranker (large model)Maximum

3. Not Precomputing Embeddings in Two-Tower

Section titled “3. Not Precomputing Embeddings in Two-Tower”

The Mistake: Computing candidate embeddings on-the-fly for every query.

Why It Fails:

  • Redundant computation: Same documents embedded repeatedly
  • Latency spike: 10M documents × 10ms/embedding = 27 hours per query
  • Cost waste: Paying for same inference repeatedly

The Fix: Precompute and cache candidate embeddings:

# BAD: On-the-fly embedding
async def bad_retrieval(query: str):
query_emb = await embed(query)
candidates = await fetch_documents(query) # 10M docs
candidate_embs = await asyncio.gather(*[embed(c) for c in candidates]) # Expensive!
return vector_search(query_emb, candidate_embs)
# GOOD: Precomputed embeddings
class TwoTowerRetriever:
def __init__(self, vector_index):
self.index = vector_index # Prebuilt with candidate embeddings
async def good_retrieval(self, query: str):
query_emb = await embed(query) # Only embed query
return self.index.search(query_emb, k=100) # Instant retrieval

The Mistake: Using the same embedding model for both retrieval and reranking.

Why It Fails:

  • Misses nuanced relevance: Bi-encoders encode query and document separately; cross-encoders see them together
  • No interaction signal: Cross-encoders capture query-document interaction features
  • Suboptimal ranking: Top-100 from bi-encoder may not contain best answers

The Fix: Use cross-encoder for reranking:

# Bi-encoder (good for retrieval, bad for reranking)
# Query: "AI infrastructure costs"
# Doc: "GPU pricing is $3/hour"
# Similarity: Moderate (different words)
# Cross-encoder (good for reranking)
# Input: "[AI infrastructure costs] [GPU pricing is $3/hour]"
# Output: High score (recognizes "costs" ↔ "pricing" relationship)

The Mistake: Sending reranking requests one-by-one.

Why It Fails:

  • Throughput bottleneck: 100 QPS with single requests vs. 1000+ QPS with batching
  • Latency overhead: Network round-trips add 10-50ms per request
  • Cost inefficiency: API charges per request; batching reduces overhead

The Fix: Batch reranking requests:

# Sequential (slow)
async def rerank_sequential(query: str, candidates: List[Dict]):
results = []
for candidate in candidates:
result = await ranking_api.rerank(query, candidate) # 100ms each
results.append(result)
return results # 1000ms total
# Parallel (fast)
async def rerank_parallel(query: str, candidates: List[Dict]):
tasks = [ranking_api.rerank(query, c) for c in candidates]
results = await asyncio.gather(*tasks) # 100ms total
return results

6. Not Evaluating Recall vs. Latency Tradeoff

Section titled “6. Not Evaluating Recall vs. Latency Tradeoff”

The Mistake: Using default ANN parameters without tuning.

Why It Fails:

  • Recall@100: 70% (default) vs. 95% (tuned)
  • Latency: 20ms (default) vs. 50ms (tuned)
  • Result: Missing 25% of relevant documents for marginal latency gain

The Fix: Tune ANN parameters:

# Vertex AI Vector Search tuning
index_config = {
"approximate_neighbors_count": 150, # Increase for higher recall
"leaf_nodes_to_search_percent": 7, # Increase for higher recall
"distance_measure_type": "DOT_PRODUCT_DISTANCE"
}
# Evaluate on validation set
# Target: >90% recall@100 with <50ms latency

7. LLM Reranking Without Few-Shot Examples

Section titled “7. LLM Reranking Without Few-Shot Examples”

The Mistake: Using zero-shot LLM prompts for reranking.

Why It Fails:

  • Inconsistent scoring: LLMs give different scores for semantically similar pairs
  • No calibration: Without examples, scores aren’t comparable across candidates
  • Poor accuracy: 20-30% worse than few-shot prompting

The Fix: Use few-shot examples:

# Zero-shot (inconsistent)
prompt = f"Rate relevance (0-1): Query='{query}' Doc='{doc}'"
# Few-shot (consistent)
prompt = f"""Rate relevance (0-1) of document to query.
Examples:
Query: "What is AI?" Doc: "Artificial intelligence..." → 0.95
Query: "What is AI?" Doc: "Machine learning..." → 0.85
Query: "What is AI?" Doc: "Cloud computing..." → 0.30
Query: "{query}"
Doc: "{doc}"
Score:"""

The Mistake: New documents can’t be reranked because they lack embeddings.

Why It Fails:

  • Freshness gap: New items invisible to retrieval system
  • Manual intervention: Requires reprocessing entire index
  • User frustration: “Why isn’t my new document showing up?”

The Fix: Two-tower architecture with streaming updates:

# Vertex AI Vector Search with STREAM_UPDATE
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name="rag_index",
contents_delta_uri="gs://bucket/new_embeddings.json",
index_update_method="STREAM_UPDATE" # Real-time updates
)
# New document pipeline
async def add_new_document(doc: Dict):
# 1. Generate embedding via candidate tower
embedding = await candidate_tower.encode(doc)
# 2. Write to GCS in correct format
await write_embeddings_to_gcs([{"id": doc["id"], "embedding": embedding}])
# 3. Vector Search automatically picks up changes
# No retraining needed!

The Mistake: LLM reranking costs explode at scale.

Cost Projection (1M queries/day):

RerankerCost/QueryDaily CostMonthly Cost
No reranking$0.0001$100$3,000
Semantic API$0.0003$300$9,000
LLM Reranker$0.05$50,000$1,500,000

The Fix: Use semantic reranking for high-volume scenarios.

Reranking strategy cost-benefit calculator

Interactive widget derived from “Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking” that lets readers explore reranking strategy cost-benefit calculator.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.