Reranking Quality: Is Your Reranker Helping?

Most teams add a reranker to their RAG pipeline expecting a magic boost in quality. But without proper evaluation, you might be burning tokens for worse results—or not even noticing when your reranker starts underperforming after a model update. A hidden cost center disguised as a quality improvement.

Why This Matters

Reranking introduces cost and latency. A typical reranker like Cohere’s rerank-v4.0-pro costs $2.00 per million input tokens, and processing 50 documents per query adds up quickly. If your reranker only improves relevance by 2-3%, you’re likely losing money compared to just using a better initial retriever or a larger context window.

Worse, rerankers can degrade. A model fine-tuned on your domain might produce scores that are internally consistent but misaligned with user intent. Without continuous evaluation, you won’t catch this drift. Studies from Cohere show that the gap between “good” and “bad” reranking configurations can be a 20-30 point swing in NDCG@10, which directly translates to user satisfaction and task completion rates.

Understanding Reranking Evaluation Metrics

Reranking evaluation focuses on two primary metrics: NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank). These metrics tell you if your reranker is actually improving the order of results.

NDCG: Position-Aware Relevance

NDCG measures ranking quality by penalizing irrelevant documents that appear high in the results. A document that’s relevant but buried at position 10 is worth less than one at position 1.

Key facts from our research:

NDCG is query-dependent: A score of 0.8 for one query doesn’t mean the same as 0.8 for another Cohere v1 Docs.
Scores are normalized: Values range from 0 to 1, but you can’t directly compare a 0.9 score to a 0.45 score as “twice as relevant” Cohere v1 Docs.

MRR: First Hit Matters

MRR focuses on the position of the first relevant document. If the top result is relevant, MRR = 1.0. If the first relevant result is at position 3, MRR = 0.33.

MRR is crucial for fact-checking and lookup tasks where users just need one correct answer.

Lift Analysis: The Real Metric

Lift analysis compares your reranked results against the baseline retrieval (e.g., pure vector similarity search). You’re looking for:

NDCG Lift: NDCG_reranked - NDCG_baseline
MRR Lift: MRR_reranked - MRR_baseline

If lift is negative or near zero, your reranker is a token sink.

Practical Implementation: Measuring Reranker Quality

Establish a Baseline: Run your initial retrieval (vector search, keyword search, or hybrid) and capture the top-k results for a representative query set (30-50 queries minimum).
Rerank and Capture: Apply your reranker to the same queries, preserving the document IDs and scores.
Collect Ground Truth: Manually annotate or use existing labels to identify which documents are truly relevant for each query.
Calculate Metrics: Compute NDCG@k and MRR for both baseline and reranked results.
Analyze Lift: Determine if the reranker provides statistically significant improvement.

Code Example: Reranking Evaluation with NDCG and MRR

This Python example implements the core evaluation logic. It calculates NDCG@10 and MRR, then compares baseline vs. reranked performance.

Python

import math
from typing import List, Dict

def calculate_ndcg(relevant_docs: List[str], retrieved_docs: List[str], k: int = 10) -> float:
    """
    Calculate NDCG@k for a single query.
    relevant_docs: List of document IDs that are relevant (ground truth)
    retrieved_docs: Ordered list of document IDs returned by the retriever/reranker
    """
    # DCG: Discounted Cumulative Gain
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_docs[:k]):
        if doc_id in relevant_docs:
            # Relevance is binary here (1 if relevant, 0 otherwise)
            # For graded relevance, use log2(2 + relevance) instead
            relevance = 1
            dcg += relevance / math.log2(i + 2)  # +2 because position is 0-indexed

    # IDCG: Ideal DCG (perfect ranking)
    idcg = 0.0
    num_relevant = min(len(relevant_docs), k)
    for i in range(num_relevant):
        idcg += 1 / math.log2(i + 2)

    return dcg / idcg if idcg > 0 else 0.0

def calculate_mrr(relevant_docs: List[str], retrieved_docs: List[str]) -> float:
    """
    Calculate MRR for a single query.
    """
    for i, doc_id in enumerate(retrieved_docs):
        if doc_id in relevant_docs:
            return 1.0 / (i + 1)
    return 0.0

def evaluate_reranking(
    baseline_results: Dict[str, List[str]],
    reranked_results: Dict[str, List[str]],
    ground_truth: Dict[str, List[str]],
    k: int = 10
) -> Dict[str, float]:
    """
    Evaluate reranking performance across multiple queries.

    Args:
        baseline_results: {query_id: [doc_ids in order]}
        reranked_results: {query_id: [doc_ids in order]}
        ground_truth: {query_id: [relevant_doc_ids]}
        k: Cutoff for NDCG@k
    """
    metrics = {
        'ndcg_baseline': [],
        'ndcg_reranked': [],
        'mrr_baseline': [],
        'mrr_reranked': []
    }

    for query_id in ground_truth.keys():
        relevant_docs = ground_truth[query_id]

        # Calculate for baseline
        baseline_docs = baseline_results.get(query_id, [])
        metrics['ndcg_baseline'].append(calculate_ndcg(relevant_docs, baseline_docs, k))
        metrics['mrr_baseline'].append(calculate_mrr(relevant_docs, baseline_docs))

        # Calculate for reranked
        reranked_docs = reranked_results.get(query_id, [])
        metrics['ndcg_reranked'].append(calculate_ndcg(relevant_docs, reranked_docs, k))
        metrics['mrr_reranked'].append(calculate_mrr(relevant_docs, reranked_docs))

    # Average across all queries
    results = {}
    for metric_name, values in metrics.items():
        results[metric_name] = sum(values) / len(values) if values else 0.0

    # Calculate lift
    results['ndcg_lift'] = results['ndcg_reranked'] - results['ndcg_baseline']
    results['mrr_lift'] = results['mrr_reranked'] - results['mrr_baseline']

    return results

# Example usage
if __name__ == "__main__":
    # Sample data
    baseline_results = {
        "q1": ["doc_a", "doc_b", "doc_c", "doc_d", "doc_e"],
        "q2": ["doc_x", "doc_y", "doc_z", "doc_a", "doc_b"]
    }

    reranked_results = {
        "q1": ["doc_c", "doc_a", "doc_b", "doc_d", "doc_e"],  # doc_c moved to top
        "q2": ["doc_a", "doc_y", "doc_x", "doc_z", "doc_b"]   # doc_a moved to top
    }

    ground_truth = {
        "q1": ["doc_a", "doc_c"],  # doc_a and doc_c are relevant
        "q2": ["doc_a"]            # only doc_a is relevant
    }

    results = evaluate_reranking(baseline_results, reranked_results, ground_truth)

    print("Evaluation Results:")
    print(f"Baseline NDCG@10: {results['ndcg_baseline']:.3f}")
    print(f"Reranked NDCG@10: {results['ndcg_reranked']:.3f}")
    print(f"NDCG Lift: {results['ndcg_lift']:.3f}")
    print(f"Baseline MRR: {results['mrr_baseline']:.3f}")
    print(f"Reranked MRR: {results['mrr_reranked']:.3f}")
    print(f"MRR Lift: {results['mrr_lift']:.3f}")

Common Pitfalls

Even experienced teams fall into these traps when evaluating rerankers. Here are the most common failure modes that hide the true impact of your reranking layer:

Quick Reference

Metric Cheat Sheet

Metric	What It Measures	Best For	Interpretation
NDCG@k	Position-aware relevance with graded scores	Overall ranking quality, research queries	0.7+ is good, 0.9+ is excellent
MRR	Position of first relevant result	Fact lookup, single-answer queries	0.5+ means first relevant is usually in top 2
Precision@k	% of top-k results that are relevant	When you need all results to be good	0.8+ means 8/10 top results are relevant
Lift	Difference from baseline retrieval	Proving reranker value	Greater than 0.05 is significant, less than 0 is harmful

Cost-Benefit Thresholds

Use this decision matrix to determine if reranking is worth it:

Scenario	Baseline NDCG	Rerank Cost	Required Lift	Decision
High-volume consumer search	0.65	$0.01/query	Greater than 3%	✅ Needs 0.67+ NDCG
Enterprise knowledge base	0.55	$0.005/query	Greater than 5%	✅ Needs 0.60+ NDCG
Internal tooling	0.70	$0.002/query	Greater than 2%	⚠️ Marginal
Real-time chatbot	0.60	$0.02/query	Greater than 8%	❌ Too expensive

Evaluation Checklist

Before deploying any reranker, verify:

Baseline established: 30-50 representative queries evaluated
Ground truth collected: Human annotations or verified labels
Lift calculated: NDCG and MRR improvement measured
Cost analyzed: Token usage quantified per query
Latency measured: End-to-end impact assessed
Threshold set: Relevance cutoff determined
A/B test planned: Production rollout strategy defined
Monitoring: Continuous evaluation pipeline configured

Reranking lift calculator (before/after → improvement metrics)

Interactive widget derived from “Reranking Quality: Is Your Reranker Helping?” that lets readers explore reranking lift calculator (before/after → improvement metrics).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Reranking quality isn’t a “set it and forget it” component. It requires continuous measurement against baselines with clear ground truth. The key metrics—NDCG and MRR—tell you if your reranker is improving ranking order, while lift analysis quantifies the value of that improvement.

Remember: a reranker that costs $0.01 per query needs to deliver at least 3-5% NDCG lift to justify its existence in most production systems. Anything less is likely better addressed through improved initial retrieval or chunking strategies.

Cohere Rerank Documentation - Official API reference and model details
Google Cloud Ranking API - Alternative reranking service with similar evaluation patterns
OpenSearch Rerank Processor - Self-hosted reranking option for OpenSearch users
Azure AI Foundry RAG Evaluators - Managed evaluation framework for RAG pipelines

Reranking Quality: Is Your Reranker Helping?

Reranking Quality: Is Your Reranker Helping?

Why This Matters

Understanding Reranking Evaluation Metrics

NDCG: Position-Aware Relevance

MRR: First Hit Matters

Lift Analysis: The Real Metric

Practical Implementation: Measuring Reranker Quality

Code Example: Reranking Evaluation with NDCG and MRR

Common Pitfalls

Quick Reference

Metric Cheat Sheet

Cost-Benefit Thresholds

Evaluation Checklist

Widget

Summary

Related Resources