Skip to content
GitHubX/TwitterRSS

RAG Evaluation Metrics: Faithfulness, Relevance, Precision

RAG Evaluation Metrics: Faithfulness, Relevance, Precision

Section titled “RAG Evaluation Metrics: Faithfulness, Relevance, Precision”

RAG systems fail silently. Your pipeline can return confident-sounding answers that are complete fabrications, or retrieve irrelevant documents that waste tokens and money. Without proper evaluation, these failures compound—costing thousands in API fees while eroding user trust. This guide covers the three critical RAG metrics (faithfulness, relevance, precision) with production-ready implementations to catch these failures before they reach production.

RAG evaluation isn’t optional—it’s the difference between a system that works and one that appears to work. Consider these real-world impacts:

Cost Impact: A financial reporting RAG system evaluated by Docugami achieved 100% faithfulness using claim-based evaluation. When the same response was corrupted (changing “2022” to “1922”), faithfulness dropped to 50%, correctly identifying the hallucination. Without this metric, the error would have cost millions in bad financial advice Cohere, 2024.

Retrieval Quality: Microsoft Foundry users discovered their retrieval system passed NDCG@3 (0.646) but failed Fidelity (0.019), meaning ranking looked good but missed critical documents. This guided parameter sweeps that improved retrieval by testing vector vs. semantic search, different top-k values, and chunk sizes Microsoft Foundry, 2024.

API Costs: Using GPT-4o-mini ($0.15/$0.60 per 1M tokens) for evaluation instead of GPT-4o ($5/$15) reduces evaluation costs by 30x while maintaining quality for most metrics. For a 10,000-query test set, that’s $9 vs. $300 per evaluation run OpenAI Pricing, 2024.

Faithfulness measures whether your generated response contains only information supported by retrieved contexts. It’s the primary defense against hallucinations.

How It Works: The evaluator decomposes your response into individual claims (verifiable statements), then checks each claim against the retrieved context. If a claim isn’t supported, faithfulness drops.

Why It Matters: A faithfulness score below 0.9 indicates your model is adding information not present in sources. For legal or financial RAG, this is unacceptable.

Answer relevance evaluates whether the response comprehensively addresses the user’s query without missing critical information.

Components:

  • Accuracy: Does the answer match the query intent?
  • Completeness: Are all aspects of the question addressed?
  • Directness: Is the response focused or meandering?

Scoring: High relevance means the response contains all necessary information to satisfy the query. Low relevance indicates gaps that require follow-up questions.

Context Precision: The Retrieval Quality Metric

Section titled “Context Precision: The Retrieval Quality Metric”

Context precision measures whether retrieved documents are actually relevant to the query. It’s calculated using Mean Average Precision (mAP) against ground truth labels.

Why Separate Retrieval Evaluation?: If your faithfulness is high but relevance is low, the problem is retrieval—you’re getting irrelevant documents. If faithfulness is low but relevance is high, the problem is generation—your model is hallucinating despite good context.

  1. Define your evaluation dataset: Collect 50-100 diverse queries with retrieved contexts, generated responses, and ground truth answers. Include edge cases: ambiguous queries, multi-hop questions, and queries requiring synthesis.

  2. Choose your metrics: Start with Faithfulness, Answer Relevancy, and Context Precision. Add Context Recall if you need to debug retrieval gaps.

  3. Select evaluator LLM: Use a model different from your RAG LLM to avoid bias. For cost-efficiency, use GPT-4o-mini or Haiku-3.5. For maximum accuracy, use GPT-4o or Claude 3.5 Sonnet.

  4. Run baseline evaluation: Execute metrics on your current pipeline to establish baseline scores.

  5. Iterate and optimize: Use results to tune retrieval (chunk size, top-k, search type) and generation (prompt engineering, temperature).

  6. Automate in CI/CD: Run evaluations on every pipeline change to catch regressions.

ragas_evaluation.py
from ragas import evaluate
from ragas.metrics import (
AnswerRelevancy,
Faithfulness,
ContextRecall,
ContextPrecision
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from datasets import Dataset
import pandas as pd
# Define RAG pipeline results
rag_data = [
{
"user_input": "What are the hardware requirements for building Milvus from source?",
"retrieved_contexts": [
"Hardware Requirements: 8GB of RAM, 50GB of free disk space",
"Building Milvus on Linux requires Ubuntu or CentOS systems"
],
"response": "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
"reference": "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
},
{
"user_input": "What programming language is used for Knowhere?",
"retrieved_contexts": [
"Knowhere is the algorithm library of Milvus",
"The library is written in C++ for performance"
],
"response": "Knowhere is written in C++.",
"reference": "The programming language used to write Knowhere is C++."
}
]
# Convert to Ragas EvaluationDataset
df = pd.DataFrame(rag_data)
dataset = Dataset.from_pandas(df)
# Initialize evaluator LLM (different from RAG LLM to avoid bias)
# Use GPT-4o-mini for cost-efficiency or GPT-4o for accuracy
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
evaluator_llm = LangchainLLMWrapper(llm)
# Define metrics
# Faithfulness: Measures if response claims are supported by contexts
# Answer Relevancy: Measures relevance to query
# Context Precision: Measures if retrieved contexts are relevant
# Context Recall: Measures if contexts contain all ground truth
metrics = [
Faithfulness(llm=evaluator_llm),
AnswerRelevancy(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextRecall(llm=evaluator_llm)
]
# Execute evaluation
print("Starting RAG evaluation with Ragas...")
results = evaluate(dataset=dataset, metrics=metrics)
# Display results
print("\nEvaluation Results:")
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")
# Interpretation guide
print("\nInterpretation:")
print("- Faithfulness greater than 0.9: Response is well-grounded in retrieved contexts")
print("- Answer Relevancy greater than 0.85: Response directly addresses the query")
print("- Context Precision greater than 0.8: Retrieved contexts are relevant")
print("- Context Recall greater than 0.9: All necessary information was retrieved")
# For production: Save results for tracking
# results_df = results.to_pandas()
# results_df.to_csv("rag_evaluation_results.csv", index=False)
ragas_evaluation.ts
import { evaluate } from 'ragas';
import { ChatOpenAI } from '@langchain/openai';
import { LangchainLLMWrapper } from 'ragas/dist/llms';
import { Dataset } from 'datasets';
// Define RAG pipeline results
const ragData = [
{
user_input: "What are the hardware requirements for building Milvus from source?",
retrieved_contexts: [
"Hardware Requirements: 8GB of RAM, 50GB of free disk space",
"Building Milvus on Linux requires Ubuntu or CentOS systems"
],
response: "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
reference: "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
}
];
// Initialize evaluator LLM
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0
});
const evaluatorLLM = new LangchainLLMWrapper(llm);
// Define metrics
const metrics = [
"faithfulness",
"answer_relevancy",
"context_precision",
"context_recall"
];
// Execute evaluation
async function evaluateRAG() {
console.log("Starting RAG evaluation...");
const results = await evaluate({
dataset: ragData,
metrics: metrics,
llm: evaluatorLLM
});
console.log("\nEvaluation Results:");
for (const [metric, score] of Object.entries(results)) {
console.log(`${metric}: ${score.toFixed(4)}`);
}
// Interpretation
console.log("\nInterpretation:");
console.log("- Faithfulness > 0.9: Response is well-grounded");
console.log("- Answer Relevancy > 0.85: Response directly addresses query");
}
evaluateRAG().catch(console.error);
  • Self-Preference Bias: Using the same LLM for both RAG generation and evaluation introduces bias. The model favors its own writing style and may overlook factual errors. Solution: Always use a separate, strong evaluator model (GPT-4o, Claude 3.5 Sonnet).

  • Single-Score Obsession: Relying on aggregate metrics (e.g., “faithfulness = 0.85”) without examining claim-level breakdowns. You miss specific failure modes like numeric hallucinations or temporal errors. Solution: Analyze individual claims and failure patterns.

  • Ignoring Position Sensitivity: Critical evidence in the middle of long contexts often gets overlooked (the “Lost in the Middle” problem). Solution: Test retrieval quality across different context positions and chunk orders.

  • Metadata Bias: Evaluator LLMs can be swayed by source prestige or author names in contexts. Solution: Run counterfactual tests—swap high-prestige and low-prestige sources while keeping content identical.

  • Weak Evaluator Models: Using GPT-3.5 or Haiku for evaluation saves money but produces unreliable judgments. Evaluation is a complex reasoning task. Solution: Use frontier models (GPT-4o, Claude 3.5) for critical systems; validate cheaper models against human judgments.

  • No Versioning: Not tracking evaluation dataset versions, rubrics, and prompts makes it impossible to measure improvement. Solution: Git-track all evaluation artifacts and log every run.

  • Production Disconnect: Evaluating on synthetic data that doesn’t match production query distribution. Solution: Continuously sample production queries (with privacy safeguards) and add them to your evaluation set.

  • Skipping Human Audits: Blindly trusting LLM-judge outputs without periodic human review. Solution: For high-impact scenarios, have domain experts audit 5-10% of evaluations monthly.

RAG evaluation costs scale with dataset size and metric complexity. Here’s the current pricing landscape:

ModelInput CostOutput CostContext WindowUse Case
GPT-4o$5.00/1M$15.00/1M128KHigh-accuracy evaluation
GPT-4o-mini$0.15/1M$0.60/1M128KCost-efficient evaluation
Claude 3.5 Sonnet$3.00/1M$15.00/1M200KBalanced accuracy/cost
Haiku-3.5$1.25/1M$5.00/1M200KFast, cheap evaluation

Cost Optimization Strategies:

  1. Use GPT-4o-mini for 90% of evaluations: Only use GPT-4o for critical failure analysis.
  2. Batch processing: Many frameworks support batch evaluation to reduce API overhead.
  3. Early stopping: If faithfulness drops below 0.5, stop the evaluation and fix the pipeline.
  4. Sampling: For large datasets, evaluate on a statistically significant sample (e.g., 100-200 queries).

RAG evaluator (input query+context+response → metric scores)

Interactive widget derived from “RAG Evaluation Metrics: Faithfulness, Relevance, Precision” that lets readers explore rag evaluator (input query+context+response → metric scores).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

MetricWhat It MeasuresGood ScoreRed FlagAction If Low
FaithfulnessClaims supported by contextgreater than 0.9less than 0.7Reduce hallucinations: add context, lower temp, better prompts
Answer RelevancyResponse addresses querygreater than 0.85less than 0.7Improve retrieval: better embeddings, increase top-k, re-rank
Context PrecisionRetrieved docs are relevantgreater than 0.8less than 0.6Tune search: adjust chunk size, try hybrid search
Context RecallAll necessary docs retrievedgreater than 0.9less than 0.75Expand search: more sources, better indexing
NDCG@3Ranking qualitygreater than 0.7less than 0.5Re-rank results, tune similarity thresholds
FidelityQuery requirements metgreater than 0.7less than 0.5Increase top-k, improve query understanding
  • Faithfulness is your precision guardrail—measure it first to catch hallucinations.
  • Answer Relevance ensures you’re not missing critical information (recall).
  • Context Precision diagnoses retrieval problems before they contaminate generation.
  • Claim-based evaluation provides granular insight into specific failure modes.
  • Separate evaluator LLM is non-negotiable to avoid bias.
  • Cost-efficient evaluation is possible with GPT-4o-mini or Haiku-3.5.
  • Automate evaluation in CI/CD to catch regressions before production.