Skip to content
GitHubX/TwitterRSS

Context Relevance: Are You Retrieving the Right Content?

Context Relevance: Are You Retrieving the Right Content?

Section titled “Context Relevance: Are You Retrieving the Right Content?”

Your RAG pipeline is generating confident, fluent responses—but are they accurate? A hidden failure mode in production RAG systems is context relevance: the retrieved documents might not actually support the answer, even if they’re semantically similar. Studies show that poor retrieval quality accounts for over 60% of RAG hallucinations. This guide provides production-ready techniques to measure and optimize retrieval relevance, from precision/recall tradeoffs to advanced reranking strategies.

In production RAG systems, the retrieval stage determines the ceiling for answer quality. If your retriever returns irrelevant chunks, even the most powerful LLM cannot generate a faithful response. Most teams focus on prompt engineering while overlooking that garbage in, garbage out applies exponentially to RAG.

The business impact is measurable:

  • Cost inefficiency: You pay for LLM tokens to process irrelevant context
  • User trust erosion: Inaccurate answers damage credibility
  • Latency waste: Processing unnecessary tokens increases response time

In RAG, precision measures the fraction of retrieved chunks that are relevant, while recall measures the fraction of relevant chunks that were retrieved. These metrics exist in tension:

  • High recall, low precision: Retrieve all relevant docs but bury them in noise. LLMs struggle with “lost in the middle” effects.
  • High precision, low recall: Retrieve only the most relevant chunks but miss critical information needed for comprehensive answers.

Production systems target precision > 0.8 with recall > 0.7 for optimal balance.

Context Precision measures the proportion of retrieved chunks that are relevant to the query:

Poor retrieval quality is the single largest contributor to RAG failures in production. When retrieved context lacks relevance, even state-of-the-art LLMs generate confident but incorrect answers. This creates a cascade of business impacts:

Cost and Efficiency Degradation

  • You pay for LLM tokens to process irrelevant context, increasing costs by 20-40% according to industry benchmarks
  • Unnecessary token processing adds 150-500ms latency per query
  • Context windows fill with noise, leaving less room for high-quality information

Accuracy and Trust Erosion

  • Studies show 60%+ of RAG hallucinations stem from retrieval quality issues, not generation problems
  • Users lose trust when systems provide authoritative-sounding answers to the wrong questions
  • Recovery requires expensive re-indexing or manual intervention

The Precision/Recall Tradeoff in Practice High recall without precision buries critical information in noise, while high precision without recall misses key facts. Production systems must balance these competing priorities:

  • Precision > 0.8: At least 80% of retrieved chunks should be relevant
  • Recall > 0.7: At least 70% of relevant chunks should be retrieved
  • Target: Optimal balance occurs when both metrics exceed these thresholds

Vertex AI provides two reranking approaches optimized for different latency/accuracy requirements:

Semantic Reranker (less than 100ms latency, state-of-the-art performance)

from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
# Initialize once per session
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
RAG_CORPUS_RESOURCE = f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/your-corpus-id"
RANKER_MODEL_NAME = "semantic-ranker-default@latest"
MODEL_NAME = "gemini-2.0-flash"
vertexai.init(project=PROJECT_ID, location=LOCATION)
# Configure retrieval with semantic reranking
config = rag.RagRetrievalConfig(
top_k=10,
ranking=rag.Ranking(
rank_service=rag.RankService(model_name=RANKER_MODEL_NAME)
)
)
# Create retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_resources=[rag.RagResource(rag_corpus=RAG_CORPUS_RESOURCE)],
),
rag_retrieval_config=config
)
)
# Initialize model with retrieval tool
rag_model = GenerativeModel(
model_name=MODEL_NAME,
tools=[rag_retrieval_tool]
)
# Generate response with context-aware retrieval
response = rag_model.generate_content("What is the sky color and why?")
print(response.text)

LLM Reranker (1-2 second latency, higher accuracy for complex queries)

from vertexai import rag
import vertexai
# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
RAG_CORPUS_RESOURCE = f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/your-corpus-id"
LLM_MODEL_NAME = "gemini-2.0-flash"
vertexai.init(project=PROJECT_ID, location=LOCATION)
# Configure LLM reranker
rag_retrieval_config = rag.RagRetrievalConfig(
top_k=10,
ranking=rag.Ranking(
llm_ranker=rag.LlmRanker(model_name=LLM_MODEL_NAME)
)
)
# Execute retrieval with LLM reranking
response = rag.retrieval_query(
rag_resources=[rag.RagResource(rag_corpus=RAG_CORPUS_RESOURCE)],
text="What are the key benefits of semantic search?",
rag_retrieval_config=rag_retrieval_config
)
print(response)

2. Hybrid Search with Reciprocal Rank Fusion

Section titled “2. Hybrid Search with Reciprocal Rank Fusion”

Combine keyword and semantic search for comprehensive retrieval:

from typing import List, Tuple
import numpy as np
def reciprocal_rank_fusion(
keyword_results: List[Tuple[str, float]],
semantic_results: List[Tuple[str, float]],
k: int = 60
) -> List[Tuple[str, float]]:
"""
Implement Reciprocal Rank Fusion for hybrid search results.
RRF Score = 1/(k + rank)
"""
# Create rank maps
keyword_ranks = {doc_id: idx for idx, (doc_id, _) in enumerate(keyword_results)}
semantic_ranks = {doc_id: idx for idx, (doc_id, _) in enumerate(semantic_results)}
# Calculate RRF scores
rrf_scores = {}
all_docs = set(keyword_ranks.keys()) | set(semantic_ranks.keys())
for doc_id in all_docs:
score = 0.0
if doc_id in keyword_ranks:
score += 1.0 / (k + keyword_ranks[doc_id])
if doc_id in semantic_ranks:
score += 1.0 / (k + semantic_ranks[doc_id])
rrf_scores[doc_id] = score
# Sort by score descending
sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results
# Example usage
keyword_search = [("doc1", 0.95), ("doc2", 0.87), ("doc3", 0.76)]
semantic_search = [("doc2", 0.92), ("doc1", 0.88), ("doc4", 0.71)]
fused_results = reciprocal_rank_fusion(keyword_search, semantic_search)
print("Hybrid RRF Results:")
for doc_id, score in fused_results:
print(f" {doc_id}: {score:.4f}")

Narrow search scope before semantic retrieval:

from typing import List, Dict, Optional
import re
class MetadataFilter:
def __init__(self, metadata_fields: List[str]):
self.metadata_fields = metadata_fields
self.filter_patterns = {
'date': r'\b(\d{4}|\d{4}-\d{2}|\d{4}-\d{2}-\d{2})\b',
'category': r'\b(technical|legal|financial|medical|academic)\b',
'language': r'\b(english|spanish|french|german|chinese)\b',
}
def extract_filters_from_query(self, query: str) -> Dict[str, str]:
"""Extract metadata filters from natural language query."""
filters = {}
for field, pattern in self.filter_patterns.items():
match = re.search(pattern, query.lower())
if match:
filters[field] = match.group(1)
return filters
def apply_filters(self, documents: List[Dict],
explicit_filters: Optional[Dict] = None,
query: Optional[str] = None) -> List[Dict]:
"""Apply metadata filters to narrow search scope."""
# Extract filters from query if provided
inferred_filters = {}
if query:
inferred_filters = self.extract_filters_from_query(query)
# Combine explicit and inferred filters
all_filters = {**(explicit_filters or {}), **inferred_filters}
if not all_filters:
return documents
# Filter documents
filtered_docs = []
for doc in documents:
match = True
for field, value in all_filters.items():
if field in doc.get('metadata', {}):
doc_value = str(doc['metadata'][field]).lower()
if value.lower() not in doc_value:
match = False
break
if match:
filtered_docs.append(doc)
return filtered_docs
# Example usage
metadata_filter = MetadataFilter(['date', 'category', 'language'])
docs = [
{
"id": "doc1",
"content": "Technical specification for API",
"metadata": {"category": "technical", "date": "2024-01-15", "language": "english"}
},
{
"id": "doc2",
"content": "Legal contract terms",
"metadata": {"category": "legal", "date": "2024-03-20", "language": "english"}
}
]
# Test 1: Explicit filter
filtered = metadata_filter.apply_filters(docs, explicit_filters={'category': 'technical'})
# Result: [doc1]
# Test 2: Query-based inference
query = "What are the technical specs from 2024?"
filtered = metadata_filter.apply_filters(docs, query=query)
# Result: [doc1]

Research shows LLMs perform best when relevant information is at the beginning or end of context. Reorder documents to place least relevant in middle positions:

from typing import List, Dict, Any
def reorder_for_llm_context(
documents: List[Dict[str, Any]],
similarity_scores: List[float]
) -> List[Dict[str, Any]]:
"""
Reorder retrieved documents to mitigate 'Lost in the Middle' problem.
Strategy: Place most relevant docs at beginning and end,
least relevant in the middle.
Args:
documents: List of document content/metadata
similarity_scores: Corresponding similarity scores
Returns:
Reordered list of documents
"""
# Pair documents with scores and sort by score descending
scored_docs = list(zip(documents, similarity_scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
if len(scored_docs) <= 2:
return [doc for doc, _ in scored_docs]
# Separate into high, medium, low relevance groups
n = len(scored_docs)
high_relevance = scored_docs[:n//3] # Top 1/3
medium_relevance = scored_docs[n//3:2*n//3] # Middle 1/3
low_relevance = scored_docs[2*n//3:] # Bottom 1/3
# Reorder: high -> low -> medium
# This places most relevant at start and end, least in middle
reordered = []
reordered.extend([doc for doc, _ in high_relevance])
reordered.extend([doc for doc, _ in low_relevance])
reordered.extend([doc for doc, _ in medium_relevance])
return reordered
# Example with 6 documents
sample_docs = [
{"id": "doc1", "content": "Most relevant information"},
{"id": "doc2", "content": "Second most relevant"},
{"id": "doc3", "content": "Third relevant"},
{"id": "doc4", "content": "Fourth relevant"},
{"id": "doc5", "content": "Fifth relevant"},
{"id": "doc6", "content": "Least relevant"}
]
scores = [0.95, 0.88, 0.76, 0.65, 0.54, 0.43]
reordered = reorder_for_llm_context(sample_docs, scores)
print("Reordered for LLM context:")
for i, doc in enumerate(reordered, 1):
print(f" Position {i}: {doc['id']} (score: {scores[sample_docs.index(doc)]})")

Avoid these critical mistakes that undermine retrieval quality:

1. Ignoring the ‘Lost in the Middle’ Problem

  • Symptom: High-quality retrieved chunks buried in the middle of context
  • Impact: LLMs ignore 30-50% of middle-positioned information
  • Fix: Implement reordering or limit top-k to 6-8 chunks

2. Static Chunk Sizes Without Context Awareness

  • Symptom: Using 512-token chunks for dense technical documents
  • Impact: Semantic boundaries split, losing critical context
  • Fix: Use adaptive chunking: 256 tokens for Q&A, 1024+ for summarization

3. Over-Reliance on Semantic Search Alone

  • Symptom: Missing exact matches for proper names, IDs, or codes
  • Impact: 15-25% recall drop for specific entity queries
  • Fix: Always implement hybrid search with BM25 fallback

4. Metadata Filtering After Retrieval

  • Symptom: Filtering 1000 chunks down to 50 relevant ones post-retrieval
  • Impact: Wasted compute, increased latency, poor user experience
  • Fix: Apply metadata filters before semantic search

5. Not Monitoring Retrieval Quality

  • Symptom: Optimizing generation while retrieval silently fails
  • Impact: 60%+ of hallucinations trace to retrieval issues
  • Fix: Track context precision/recall metrics in production

6. Failing to Re-embed After Model Changes

  • Symptom: Switching embedding models without re-indexing
  • Impact: Semantic mismatch, catastrophic retrieval degradation
  • Fix: Always re-embed corpus when changing embedding models
StrategyLatencyAccuracyUse Case
Semantic Rerankerless than 100msHighReal-time, high-volume
LLM Reranker1-2sVery HighComplex queries, low volume
RRF Hybridless than 50msMediumKeyword + semantic balance
No Rerankingless than 10msLowPrototyping only
Document TypeChunk SizeOverlapStrategy
Technical Docs1000-1500 tokens15-20%Semantic boundaries
Legal Contracts800-1200 tokens20-25%Clause-based
FAQ/Q&A256-512 tokens10-15%Question-answer pairs
Research Papers512-1000 tokens15-20%Section-based
Chat Logs300-600 tokens25-30%Turn-based
  • Precision: greater than 0.8 (80% of retrieved chunks relevant)
  • Recall: greater than 0.7 (70% of relevant chunks retrieved)
  • Context Precision: greater than 0.75 (quality of top-k chunks)
  • Faithfulness: greater than 0.85 (answers grounded in retrieved context)
# RAG Retrieval Quality Calculator
# Estimate impact of retrieval improvements on system performance
def calculate_rag_improvement(
baseline_precision: float,
baseline_recall: float,
target_precision: float,
target_recall: float,
avg_query_cost: float,
monthly_queries: int
) -> dict:
"""
Calculate cost and quality improvements from retrieval optimization.
Args:
baseline_precision: Initial precision (0-1)
baseline_recall: Initial recall (0-1)
target_precision: Optimized precision (0-1)
target_recall: Optimized recall (0-1)
avg_query_cost: Average cost per query ($)
monthly_queries: Monthly query volume
Returns:
Dictionary with improvement metrics
"""
# Quality score (harmonic mean of precision and recall)
baseline_quality = 2 * (baseline_precision * baseline_recall) / (baseline_precision + baseline_recall + 1e-6)
target_quality = 2 * (target_precision * target_recall) / (target_precision + target_recall + 1e-6)
# Cost savings from reduced token processing
precision_improvement = (target_precision - baseline_precision) / baseline_precision
token_reduction_factor = 1 - (precision_improvement * 0.3) # 30% of precision gain reduces tokens
baseline_monthly_cost = monthly_queries * avg_query_cost
optimized_monthly_cost = baseline_monthly_cost * token_reduction_factor
# Quality improvement
quality_gain = ((target_quality - baseline_quality) / baseline_quality) * 100
return {
"baseline_quality": round(baseline_quality, 3),
"optimized_quality": round(target_quality, 3),
"quality_improvement": round(quality_gain, 1),
"baseline_monthly_cost": round(baseline_monthly_cost, 2),
"optimized_monthly_cost": round(optimized_monthly_cost, 2),
"monthly_savings": round(baseline_monthly_cost - optimized_monthly_cost, 2),
"annual_savings": round((baseline_monthly_cost - optimized_monthly_cost) * 12, 2)
}
# Example: 100K queries/month, $0.05/query average
result = calculate_rag_improvement(
baseline_precision=0.60,
baseline_recall=0.65,
target_precision=0.85,
target_recall=0.75,
avg_query_cost=0.05,
monthly_queries=100000
)
print("=== RAG Retrieval Optimization Impact ===")
print(f"Quality Score: {result['baseline_quality']} → {result['optimized_quality']} (+{result['quality_improvement']}%)")
print(f"Monthly Cost: ${result['baseline_monthly_cost']:,} → ${result['optimized_monthly_cost']:,}")
print(f"Monthly Savings: ${result['monthly_savings']:,}")
print(f"Annual Savings: ${result['annual_savings']:,}")

Context relevance analyzer (query + docs → relevance scores)

Interactive widget derived from “Context Relevance: Are You Retrieving the Right Content?” that lets readers explore context relevance analyzer (query + docs → relevance scores).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.