Most teams add a reranker to their RAG pipeline expecting a magic boost in quality. But without proper evaluation, you might be burning tokens for worse results—or not even noticing when your reranker starts underperforming after a model update. A hidden cost center disguised as a quality improvement.
Reranking introduces cost and latency. A typical reranker like Cohere’s rerank-v4.0-pro costs $2.00 per million input tokens, and processing 50 documents per query adds up quickly. If your reranker only improves relevance by 2-3%, you’re likely losing money compared to just using a better initial retriever or a larger context window.
Worse, rerankers can degrade. A model fine-tuned on your domain might produce scores that are internally consistent but misaligned with user intent. Without continuous evaluation, you won’t catch this drift. Studies from Cohere show that the gap between “good” and “bad” reranking configurations can be a 20-30 point swing in NDCG@10, which directly translates to user satisfaction and task completion rates.
Reranking evaluation focuses on two primary metrics: NDCG (Normalized Discounted Cumulative Gain) and MRR (Mean Reciprocal Rank). These metrics tell you if your reranker is actually improving the order of results.
NDCG measures ranking quality by penalizing irrelevant documents that appear high in the results. A document that’s relevant but buried at position 10 is worth less than one at position 1.
Key facts from our research:
NDCG is query-dependent: A score of 0.8 for one query doesn’t mean the same as 0.8 for another Cohere v1 Docs.
Scores are normalized: Values range from 0 to 1, but you can’t directly compare a 0.9 score to a 0.45 score as “twice as relevant” Cohere v1 Docs.
MRR focuses on the position of the first relevant document. If the top result is relevant, MRR = 1.0. If the first relevant result is at position 3, MRR = 0.33.
MRR is crucial for fact-checking and lookup tasks where users just need one correct answer.
Establish a Baseline: Run your initial retrieval (vector search, keyword search, or hybrid) and capture the top-k results for a representative query set (30-50 queries minimum).
Rerank and Capture: Apply your reranker to the same queries, preserving the document IDs and scores.
Collect Ground Truth: Manually annotate or use existing labels to identify which documents are truly relevant for each query.
Calculate Metrics: Compute NDCG@k and MRR for both baseline and reranked results.
Analyze Lift: Determine if the reranker provides statistically significant improvement.
Code Example: Reranking Evaluation with NDCG and MRR
Even experienced teams fall into these traps when evaluating rerankers. Here are the most common failure modes that hide the true impact of your reranking layer:
Reranking quality isn’t a “set it and forget it” component. It requires continuous measurement against baselines with clear ground truth. The key metrics—NDCG and MRR—tell you if your reranker is improving ranking order, while lift analysis quantifies the value of that improvement.
Remember: a reranker that costs $0.01 per query needs to deliver at least 3-5% NDCG lift to justify its existence in most production systems. Anything less is likely better addressed through improved initial retrieval or chunking strategies.