Skip to content
GitHubX/TwitterRSS

Reranking Quality: Is Your Reranker Helping?

Reranking Quality: Is Your Reranker Helping?

Section titled “Reranking Quality: Is Your Reranker Helping?”

Most teams add a reranker to their RAG pipeline expecting a magic boost in quality. But without proper evaluation, you might be burning tokens for worse results—or not even noticing when your reranker starts underperforming after a model update. A hidden cost center disguised as a quality improvement.

Reranking introduces cost and latency. A typical reranker like Cohere’s rerank-v4.0-pro costs $2.00 per million input tokens, and processing 50 documents per query adds up quickly. If your reranker only improves relevance by 2-3%, you’re likely losing money compared to just using a better initial retriever or a larger context window.

Worse, rerankers can degrade. A model fine-tuned on your domain might produce scores that are internally consistent but misaligned with user intent. Without continuous evaluation, you won’t catch this drift. Studies from Cohere show that the gap between “good” and “bad” reranking configurations can be a 20-30 point swing in NDCG@10, which directly translates to user satisfaction and task completion rates.

Understanding Reranking Evaluation Metrics

Section titled “Understanding Reranking Evaluation Metrics”

Reranking evaluation focuses on two primary metrics: NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank). These metrics tell you if your reranker is actually improving the order of results.

NDCG measures ranking quality by penalizing irrelevant documents that appear high in the results. A document that’s relevant but buried at position 10 is worth less than one at position 1.

Key facts from our research:

  • NDCG is query-dependent: A score of 0.8 for one query doesn’t mean the same as 0.8 for another Cohere v1 Docs.
  • Scores are normalized: Values range from 0 to 1, but you can’t directly compare a 0.9 score to a 0.45 score as “twice as relevant” Cohere v1 Docs.

MRR focuses on the position of the first relevant document. If the top result is relevant, MRR = 1.0. If the first relevant result is at position 3, MRR = 0.33.

MRR is crucial for fact-checking and lookup tasks where users just need one correct answer.

Lift analysis compares your reranked results against the baseline retrieval (e.g., pure vector similarity search). You’re looking for:

  • NDCG Lift: NDCG_reranked - NDCG_baseline
  • MRR Lift: MRR_reranked - MRR_baseline

If lift is negative or near zero, your reranker is a token sink.

Practical Implementation: Measuring Reranker Quality

Section titled “Practical Implementation: Measuring Reranker Quality”
  1. Establish a Baseline: Run your initial retrieval (vector search, keyword search, or hybrid) and capture the top-k results for a representative query set (30-50 queries minimum).
  2. Rerank and Capture: Apply your reranker to the same queries, preserving the document IDs and scores.
  3. Collect Ground Truth: Manually annotate or use existing labels to identify which documents are truly relevant for each query.
  4. Calculate Metrics: Compute NDCG@k and MRR for both baseline and reranked results.
  5. Analyze Lift: Determine if the reranker provides statistically significant improvement.

Code Example: Reranking Evaluation with NDCG and MRR

Section titled “Code Example: Reranking Evaluation with NDCG and MRR”

This Python example implements the core evaluation logic. It calculates NDCG@10 and MRR, then compares baseline vs. reranked performance.

import math
from typing import List, Dict
def calculate_ndcg(relevant_docs: List[str], retrieved_docs: List[str], k: int = 10) -> float:
"""
Calculate NDCG@k for a single query.
relevant_docs: List of document IDs that are relevant (ground truth)
retrieved_docs: Ordered list of document IDs returned by the retriever/reranker
"""
# DCG: Discounted Cumulative Gain
dcg = 0.0
for i, doc_id in enumerate(retrieved_docs[:k]):
if doc_id in relevant_docs:
# Relevance is binary here (1 if relevant, 0 otherwise)
# For graded relevance, use log2(2 + relevance) instead
relevance = 1
dcg += relevance / math.log2(i + 2) # +2 because position is 0-indexed
# IDCG: Ideal DCG (perfect ranking)
idcg = 0.0
num_relevant = min(len(relevant_docs), k)
for i in range(num_relevant):
idcg += 1 / math.log2(i + 2)
return dcg / idcg if idcg > 0 else 0.0
def calculate_mrr(relevant_docs: List[str], retrieved_docs: List[str]) -> float:
"""
Calculate MRR for a single query.
"""
for i, doc_id in enumerate(retrieved_docs):
if doc_id in relevant_docs:
return 1.0 / (i + 1)
return 0.0
def evaluate_reranking(
baseline_results: Dict[str, List[str]],
reranked_results: Dict[str, List[str]],
ground_truth: Dict[str, List[str]],
k: int = 10
) -> Dict[str, float]:
"""
Evaluate reranking performance across multiple queries.
Args:
baseline_results: {query_id: [doc_ids in order]}
reranked_results: {query_id: [doc_ids in order]}
ground_truth: {query_id: [relevant_doc_ids]}
k: Cutoff for NDCG@k
"""
metrics = {
'ndcg_baseline': [],
'ndcg_reranked': [],
'mrr_baseline': [],
'mrr_reranked': []
}
for query_id in ground_truth.keys():
relevant_docs = ground_truth[query_id]
# Calculate for baseline
baseline_docs = baseline_results.get(query_id, [])
metrics['ndcg_baseline'].append(calculate_ndcg(relevant_docs, baseline_docs, k))
metrics['mrr_baseline'].append(calculate_mrr(relevant_docs, baseline_docs))
# Calculate for reranked
reranked_docs = reranked_results.get(query_id, [])
metrics['ndcg_reranked'].append(calculate_ndcg(relevant_docs, reranked_docs, k))
metrics['mrr_reranked'].append(calculate_mrr(relevant_docs, reranked_docs))
# Average across all queries
results = {}
for metric_name, values in metrics.items():
results[metric_name] = sum(values) / len(values) if values else 0.0
# Calculate lift
results['ndcg_lift'] = results['ndcg_reranked'] - results['ndcg_baseline']
results['mrr_lift'] = results['mrr_reranked'] - results['mrr_baseline']
return results
# Example usage
if __name__ == "__main__":
# Sample data
baseline_results = {
"q1": ["doc_a", "doc_b", "doc_c", "doc_d", "doc_e"],
"q2": ["doc_x", "doc_y", "doc_z", "doc_a", "doc_b"]
}
reranked_results = {
"q1": ["doc_c", "doc_a", "doc_b", "doc_d", "doc_e"], # doc_c moved to top
"q2": ["doc_a", "doc_y", "doc_x", "doc_z", "doc_b"] # doc_a moved to top
}
ground_truth = {
"q1": ["doc_a", "doc_c"], # doc_a and doc_c are relevant
"q2": ["doc_a"] # only doc_a is relevant
}
results = evaluate_reranking(baseline_results, reranked_results, ground_truth)
print("Evaluation Results:")
print(f"Baseline NDCG@10: {results['ndcg_baseline']:.3f}")
print(f"Reranked NDCG@10: {results['ndcg_reranked']:.3f}")
print(f"NDCG Lift: {results['ndcg_lift']:.3f}")
print(f"Baseline MRR: {results['mrr_baseline']:.3f}")
print(f"Reranked MRR: {results['mrr_reranked']:.3f}")
print(f"MRR Lift: {results['mrr_lift']:.3f}")

Even experienced teams fall into these traps when evaluating rerankers. Here are the most common failure modes that hide the true impact of your reranking layer:

MetricWhat It MeasuresBest ForInterpretation
NDCG@kPosition-aware relevance with graded scoresOverall ranking quality, research queries0.7+ is good, 0.9+ is excellent
MRRPosition of first relevant resultFact lookup, single-answer queries0.5+ means first relevant is usually in top 2
Precision@k% of top-k results that are relevantWhen you need all results to be good0.8+ means 8/10 top results are relevant
LiftDifference from baseline retrievalProving reranker valueGreater than 0.05 is significant, less than 0 is harmful

Use this decision matrix to determine if reranking is worth it:

ScenarioBaseline NDCGRerank CostRequired LiftDecision
High-volume consumer search0.65$0.01/queryGreater than 3%✅ Needs 0.67+ NDCG
Enterprise knowledge base0.55$0.005/queryGreater than 5%✅ Needs 0.60+ NDCG
Internal tooling0.70$0.002/queryGreater than 2%⚠️ Marginal
Real-time chatbot0.60$0.02/queryGreater than 8%❌ Too expensive

Before deploying any reranker, verify:

  • Baseline established: 30-50 representative queries evaluated
  • Ground truth collected: Human annotations or verified labels
  • Lift calculated: NDCG and MRR improvement measured
  • Cost analyzed: Token usage quantified per query
  • Latency measured: End-to-end impact assessed
  • Threshold set: Relevance cutoff determined
  • A/B test planned: Production rollout strategy defined
  • Monitoring: Continuous evaluation pipeline configured

Reranking lift calculator (before/after → improvement metrics)

Interactive widget derived from “Reranking Quality: Is Your Reranker Helping?” that lets readers explore reranking lift calculator (before/after → improvement metrics).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Reranking quality isn’t a “set it and forget it” component. It requires continuous measurement against baselines with clear ground truth. The key metrics—NDCG and MRR—tell you if your reranker is improving ranking order, while lift analysis quantifies the value of that improvement.

Remember: a reranker that costs $0.01 per query needs to deliver at least 3-5% NDCG lift to justify its existence in most production systems. Anything less is likely better addressed through improved initial retrieval or chunking strategies.