A Google Cloud e-commerce customer reduced retrieval latency from 500ms to under 100ms while improving recall@100 to 95% by implementing multi-stage retrieval with two-tower architecture and semantic reranking. The pipeline retrieves 100 candidates via vector search in 20ms, reranks them in 80ms using Vertex AI Ranking API, and generates final answers in 150ms—total end-to-end latency of 350ms versus 2+ seconds with single-stage LLM reranking. This guide covers production-ready multi-stage retrieval patterns, cost optimization strategies, and implementation code for building high-performance RAG systems.
Single-stage retrieval forces an impossible tradeoff: vector search alone misses nuanced relevance, while LLM reranking every candidate is prohibitively expensive and slow. Multi-stage architectures solve this by combining the scalability of bi-encoders (vector search) with the precision of cross-encoders (reranking).
Consider a knowledge base with 10 million documents. A naive approach using LLM reranking for all candidates would require:
10M LLM calls per query at $5/1M tokens = $50 per query
Latency : 10M × 10ms = 100,000 seconds (27 hours)
Practical impossibility for production systems
Multi-stage retrieval reduces this to:
Stage 1 : Vector search retrieves top-100 candidates in 20ms
Stage 2 : Rerank 100 candidates using specialized API (100ms) or LLM (1-2s)
Stage 3 : Generate answer with top-10 candidates (150ms)
Total cost : $0.0005 per query (99.999% reduction)
Required for:
Knowledge bases greater than 100K documents
Sub-500ms end-to-end latency requirements
High query volumes (greater than 10K queries/day)
Complex queries requiring nuanced relevance scoring
Optional for:
Small knowledge bases (less than 10K documents) where single-stage is sufficient
Batch processing where latency is not critical
Applications with simple keyword-based retrieval needs
The two-tower architecture separates query and candidate encoding, enabling precomputation and independent scaling.
Multi-stage retrieval directly impacts your bottom line. A 10M-document knowledge base with single-stage LLM reranking costs $50 per query and takes 27 hours . Multi-stage reduces this to $0.0005 per query and 350ms —a 99.999% cost reduction and 277,000x latency improvement .
The key insight: not all retrieval operations are equal . Vector search (bi-encoder) excels at scalable candidate generation but misses nuanced relevance. Cross-encoders capture fine-grained relevance but are O(k²) expensive. Multi-stage combines both: O(log n) retrieval + O(k²) reranking where k << n.
Stage Single-Stage (LLM Rerank All) Multi-Stage (Vector + Rerank Top-100) Retrieval 10M LLM calls: $50/query Vector search: $0.0002/query Reranking N/A 100 API calls: $0.0003/query Generation Included above 1 LLM call: $0.0001/query Latency 27 hours 350ms Total Cost $50/query $0.0006/query
Based on Vertex AI Pricing: Gemini 2.0 Flash ($0.10/$0.40 per 1M tokens), Ranking API (free tier)
Vertex AI Ranking API (Semantic Reranker):
Latency : less than 100ms
Cost : Per-request pricing (free tier available)
Use case : Real-time applications, high query volume
Accuracy : State-of-the-art performance
LLM Reranker (Gemini):
Latency : 1-2 seconds
Cost : LLM token pricing ($0.10/$0.40 per 1M tokens)
Use case : Complex queries requiring nuanced understanding
Accuracy : Model-dependent, higher than semantic reranker
A[Query] --> B{Knowledge Base Size}
B -->|less than 10K docs| C[Single-Stage Vector Search]
B -->|10K - 1M docs| D[Multi-Stage: Vector + Semantic Rerank]
B -->|> 1M docs| E[Two-Tower + Semantic Rerank]
D --> G[Top-100 Candidates]
G --> H[Semantic Reranking]
For Vertex AI RAG Engine Users:
Create RAG corpus with Vector Search backend
Configure RagRetrievalConfig with ranking.rank_service
Set top_k to 20-100 for retrieval
Enable ranking API via Discovery Engine API
Monitor latency: target less than 100ms for reranking stage
For Custom Two-Tower:
Train separate query and candidate encoders
Precompute candidate embeddings
Deploy to Vector Search index
Implement async reranking pipeline
Batch reranking requests for throughput
Precompute embeddings : Two-tower architecture amortizes training cost
Batch reranking : Process multiple queries in parallel
Tiered retrieval : Use keyword search for simple queries, vector for complex
Caching : Cache reranked results for identical queries
Model selection : Use smaller models for retrieval, larger for generation
from vertexai.generative_models import GenerativeModel, Tool
PROJECT_ID = "your-project-id"
RAG_CORPUS_NAME = f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/your-corpus"
RANKER_MODEL = "semantic-ranker-default@latest" # <100ms latency
LLM_MODEL = "gemini-2.0-flash" # Fast generation
vertexai.init(project=PROJECT_ID, location=LOCATION)
# Stage 1: Configure retrieval with semantic reranking
# top_k=20 retrieves 20 candidates, then reranks them
config = rag.RagRetrievalConfig(
rank_service=rag.RankService(model_name=RANKER_MODEL)
# Stage 2: Create retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
source=rag.VertexRagStore(
rag_resources=[rag.RagResource(rag_corpus=RAG_CORPUS_NAME)]
rag_retrieval_config=config
# Stage 3: Generate with LLM
tools=[rag_retrieval_tool]
# Execute pipeline with timing
def execute_pipeline(query: str):
# Vector search + reranking happens automatically
response = model.generate_content(query)
latency = time.time() - start
print(f"Latency: {latency:.2f}s")
print(f"Answer: {response.text[:200]}...")
response = execute_pipeline(
"What is the primary benefit of multi-stage retrieval?"
from typing import List, Dict, Tuple
class MultiStagePipeline:
def __init__(self, vector_index, rerank_client, llm_client):
self.vector_index = vector_index # Vertex AI Vector Search
self.rerank_client = rerank_client # Ranking API or LLM
self.llm_client = llm_client
async def retrieve_stage(self, query: str, top_k: int = 100) -> List[Dict]:
"""Stage 1: Fast vector search"""
query_embedding = await self.llm_client.embeddings.create(
model="text-embedding-3-large",
# Vector search: O(log n) complexity
candidates = self.vector_index.search(
query_embedding.data[0].embedding,
async def rerank_stage(self, query: str, candidates: List[Dict]) -> List[Tuple[Dict, float]]:
"""Stage 2: Accurate reranking"""
# Vertex AI Ranking API: <100ms
# Or LLM reranker: 1-2s for higher accuracy
# For Vertex AI Ranking API:
response = await self.rerank_client.ranking_configs.rank(
ranking_config="projects/{project}/locations/global/rankingConfigs/default_ranking_config",
model="semantic-ranker-default@latest",
"title": c.get("title", ""),
return [(c, r.score) for c, r in zip(candidates, response.records)]
async def generate_stage(self, query: str, top_reranked: List[Dict]) -> str:
"""Stage 3: Final generation"""
context = "\n\n".join([c["content"] for c in top_reranked[:3]])
response = await self.llm_client.chat.completions.create(
messages=[{"role": "user", "content": prompt}]
return response.choices[0].message.content
async def execute(self, query: str) -> str:
"""Execute full pipeline"""
# Stage 1: Retrieve 100 candidates (20ms)
candidates = await self.retrieve_stage(query, top_k=100)
# Stage 2: Rerank top 100 (100ms)
reranked = await self.rerank_stage(query, candidates)
top_10 = [c[0] for c in reranked[:10]]
# Stage 3: Generate (150ms)
answer = await self.generate_stage(query, top_10)
pipeline = MultiStagePipeline(
vector_index=vertex_ai_vector_search_index,
rerank_client=vertex_ai_ranking_api,
Multi-stage retrieval introduces new failure modes that don’t exist in single-stage systems. Here are the most critical pitfalls to avoid:
The Mistake: Retrieving 1000+ candidates and reranking all of them.
Why It Fails:
Cost explosion : Vertex AI Ranking API charges per request; 1000 candidates = 1000 requests
Latency death : 1000 × 100ms = 100 seconds (semantic reranking) or 1000 × 2s = 2000 seconds (LLM reranking)
Diminishing returns : Top-100 candidates typically contain 95%+ of relevant documents
The Fix: Limit reranking to 50-200 candidates. Use this formula:
# Optimal candidate count based on corpus size
def optimal_rerank_candidates(corpus_size: int) -> int:
return 10 # Small corpus, minimal reranking needed
elif corpus_size < 1_000_000:
return 50 # Medium corpus, moderate reranking
return 100 # Large corpus, comprehensive reranking
# Never exceed 200 candidates for cost/latency reasons
MAX_RERANK_CANDIDATES = 200
The Mistake: Adding reranking without accounting for cumulative latency.
Real-World Impact:
Target : 500ms end-to-end latency
Vector search : 20ms
Semantic reranking : 100ms
LLM generation : 150ms
Total : 270ms (within budget)
With LLM reranking : 20ms + 2000ms + 150ms = 2170ms (4.3x over budget)
The Fix: Choose reranker based on latency budget:
Latency Budget Reranker Choice Expected Accuracy less than 200ms No reranking or cached results Baseline 200-500ms Vertex AI Ranking API High 500ms-2s LLM reranker (small model) Very High greater than 2s LLM reranker (large model) Maximum
The Mistake: Computing candidate embeddings on-the-fly for every query.
Why It Fails:
Redundant computation : Same documents embedded repeatedly
Latency spike : 10M documents × 10ms/embedding = 27 hours per query
Cost waste : Paying for same inference repeatedly
The Fix: Precompute and cache candidate embeddings:
# BAD: On-the-fly embedding
async def bad_retrieval(query: str):
query_emb = await embed(query)
candidates = await fetch_documents(query) # 10M docs
candidate_embs = await asyncio.gather(*[embed(c) for c in candidates]) # Expensive!
return vector_search(query_emb, candidate_embs)
# GOOD: Precomputed embeddings
def __init__(self, vector_index):
self.index = vector_index # Prebuilt with candidate embeddings
async def good_retrieval(self, query: str):
query_emb = await embed(query) # Only embed query
return self.index.search(query_emb, k=100) # Instant retrieval
The Mistake: Using the same embedding model for both retrieval and reranking.
Why It Fails:
Misses nuanced relevance : Bi-encoders encode query and document separately; cross-encoders see them together
No interaction signal : Cross-encoders capture query-document interaction features
Suboptimal ranking : Top-100 from bi-encoder may not contain best answers
The Fix: Use cross-encoder for reranking:
# Bi-encoder (good for retrieval, bad for reranking)
# Query: "AI infrastructure costs"
# Doc: "GPU pricing is $3/hour"
# Similarity: Moderate (different words)
# Cross-encoder (good for reranking)
# Input: "[AI infrastructure costs] [GPU pricing is $3/hour]"
# Output: High score (recognizes "costs" ↔ "pricing" relationship)
The Mistake: Sending reranking requests one-by-one.
Why It Fails:
Throughput bottleneck : 100 QPS with single requests vs. 1000+ QPS with batching
Latency overhead : Network round-trips add 10-50ms per request
Cost inefficiency : API charges per request; batching reduces overhead
The Fix: Batch reranking requests:
async def rerank_sequential(query: str, candidates: List[Dict]):
for candidate in candidates:
result = await ranking_api.rerank(query, candidate) # 100ms each
return results # 1000ms total
async def rerank_parallel(query: str, candidates: List[Dict]):
tasks = [ranking_api.rerank(query, c) for c in candidates]
results = await asyncio.gather(*tasks) # 100ms total
The Mistake: Using default ANN parameters without tuning.
Why It Fails:
Recall@100 : 70% (default) vs. 95% (tuned)
Latency : 20ms (default) vs. 50ms (tuned)
Result : Missing 25% of relevant documents for marginal latency gain
The Fix: Tune ANN parameters:
# Vertex AI Vector Search tuning
"approximate_neighbors_count": 150, # Increase for higher recall
"leaf_nodes_to_search_percent": 7, # Increase for higher recall
"distance_measure_type": "DOT_PRODUCT_DISTANCE"
# Evaluate on validation set
# Target: >90% recall@100 with <50ms latency
The Mistake: Using zero-shot LLM prompts for reranking.
Why It Fails:
Inconsistent scoring : LLMs give different scores for semantically similar pairs
No calibration : Without examples, scores aren’t comparable across candidates
Poor accuracy : 20-30% worse than few-shot prompting
The Fix: Use few-shot examples:
# Zero-shot (inconsistent)
prompt = f"Rate relevance (0-1): Query='{query}' Doc='{doc}'"
prompt = f"""Rate relevance (0-1) of document to query.
Query: "What is AI?" Doc: "Artificial intelligence..." → 0.95
Query: "What is AI?" Doc: "Machine learning..." → 0.85
Query: "What is AI?" Doc: "Cloud computing..." → 0.30
The Mistake: New documents can’t be reranked because they lack embeddings.
Why It Fails:
Freshness gap : New items invisible to retrieval system
Manual intervention : Requires reprocessing entire index
User frustration : “Why isn’t my new document showing up?”
The Fix: Two-tower architecture with streaming updates:
# Vertex AI Vector Search with STREAM_UPDATE
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name="rag_index",
contents_delta_uri="gs://bucket/new_embeddings.json",
index_update_method="STREAM_UPDATE" # Real-time updates
async def add_new_document(doc: Dict):
# 1. Generate embedding via candidate tower
embedding = await candidate_tower.encode(doc)
# 2. Write to GCS in correct format
await write_embeddings_to_gcs([{"id": doc["id"], "embedding": embedding}])
# 3. Vector Search automatically picks up changes
The Mistake: LLM reranking costs explode at scale.
Cost Projection (1M queries/day):
Reranker Cost/Query Daily Cost Monthly Cost No reranking $0.0001 $100 $3,000 Semantic API $0.0003 $300 $9,000 LLM Reranker $0.05 $50,000 $1,500,000
The Fix: Use semantic reranking for high-volume scenarios.
Reranking strategy cost-benefit calculator
Interactive widget derived from “Multi-Stage Retrieval: Fast Retrieval → Accurate Reranking” that lets readers explore reranking strategy cost-benefit calculator.
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.