RAG systems fail silently. Your pipeline can return confident-sounding answers that are complete fabrications, or retrieve irrelevant documents that waste tokens and money. Without proper evaluation, these failures compound—costing thousands in API fees while eroding user trust. This guide covers the three critical RAG metrics (faithfulness, relevance, precision) with production-ready implementations to catch these failures before they reach production.
Key Takeaway
Faithfulness measures precision (no hallucinations), relevance measures recall (no missing information), and precision measures retrieval quality (no noise). You need all three to diagnose RAG failures.
RAG evaluation isn’t optional—it’s the difference between a system that works and one that appears to work. Consider these real-world impacts:
Cost Impact : A financial reporting RAG system evaluated by Docugami achieved 100% faithfulness using claim-based evaluation. When the same response was corrupted (changing “2022” to “1922”), faithfulness dropped to 50%, correctly identifying the hallucination. Without this metric, the error would have cost millions in bad financial advice Cohere, 2024 .
Retrieval Quality : Microsoft Foundry users discovered their retrieval system passed NDCG@3 (0.646) but failed Fidelity (0.019), meaning ranking looked good but missed critical documents. This guided parameter sweeps that improved retrieval by testing vector vs. semantic search, different top-k values, and chunk sizes Microsoft Foundry, 2024 .
API Costs : Using GPT-4o-mini ($0.15/$0.60 per 1M tokens) for evaluation instead of GPT-4o ($5/$15) reduces evaluation costs by 30x while maintaining quality for most metrics. For a 10,000-query test set, that’s $9 vs. $300 per evaluation run OpenAI Pricing, 2024 .
Faithfulness measures whether your generated response contains only information supported by retrieved contexts. It’s the primary defense against hallucinations.
How It Works : The evaluator decomposes your response into individual claims (verifiable statements), then checks each claim against the retrieved context. If a claim isn’t supported, faithfulness drops.
Why It Matters : A faithfulness score below 0.9 indicates your model is adding information not present in sources. For legal or financial RAG, this is unacceptable.
Answer relevance evaluates whether the response comprehensively addresses the user’s query without missing critical information.
Components :
Accuracy : Does the answer match the query intent?
Completeness : Are all aspects of the question addressed?
Directness : Is the response focused or meandering?
Scoring : High relevance means the response contains all necessary information to satisfy the query. Low relevance indicates gaps that require follow-up questions.
Context precision measures whether retrieved documents are actually relevant to the query. It’s calculated using Mean Average Precision (mAP) against ground truth labels.
Why Separate Retrieval Evaluation? : If your faithfulness is high but relevance is low, the problem is retrieval—you’re getting irrelevant documents. If faithfulness is low but relevance is high, the problem is generation—your model is hallucinating despite good context.
Define your evaluation dataset : Collect 50-100 diverse queries with retrieved contexts, generated responses, and ground truth answers. Include edge cases: ambiguous queries, multi-hop questions, and queries requiring synthesis.
Choose your metrics : Start with Faithfulness, Answer Relevancy, and Context Precision. Add Context Recall if you need to debug retrieval gaps.
Select evaluator LLM : Use a model different from your RAG LLM to avoid bias. For cost-efficiency, use GPT-4o-mini or Haiku-3.5. For maximum accuracy, use GPT-4o or Claude 3.5 Sonnet.
Run baseline evaluation : Execute metrics on your current pipeline to establish baseline scores.
Iterate and optimize : Use results to tune retrieval (chunk size, top-k, search type) and generation (prompt engineering, temperature).
Automate in CI/CD : Run evaluations on every pipeline change to catch regressions.
from ragas import evaluate
from ragas.metrics import (
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from datasets import Dataset
# Define RAG pipeline results
"user_input": "What are the hardware requirements for building Milvus from source?",
"Hardware Requirements: 8GB of RAM, 50GB of free disk space",
"Building Milvus on Linux requires Ubuntu or CentOS systems"
"response": "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
"reference": "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
"user_input": "What programming language is used for Knowhere?",
"Knowhere is the algorithm library of Milvus",
"The library is written in C++ for performance"
"response": "Knowhere is written in C++.",
"reference": "The programming language used to write Knowhere is C++."
# Convert to Ragas EvaluationDataset
df = pd.DataFrame(rag_data)
dataset = Dataset.from_pandas(df)
# Initialize evaluator LLM (different from RAG LLM to avoid bias)
# Use GPT-4o-mini for cost-efficiency or GPT-4o for accuracy
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
evaluator_llm = LangchainLLMWrapper(llm)
# Faithfulness: Measures if response claims are supported by contexts
# Answer Relevancy: Measures relevance to query
# Context Precision: Measures if retrieved contexts are relevant
# Context Recall: Measures if contexts contain all ground truth
Faithfulness(llm=evaluator_llm),
AnswerRelevancy(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextRecall(llm=evaluator_llm)
print("Starting RAG evaluation with Ragas...")
results = evaluate(dataset=dataset, metrics=metrics)
print("\nEvaluation Results:")
for metric_name, score in results.items():
print(f"{metric_name}: {score:.4f}")
print("\nInterpretation:")
print("- Faithfulness greater than 0.9: Response is well-grounded in retrieved contexts")
print("- Answer Relevancy greater than 0.85: Response directly addresses the query")
print("- Context Precision greater than 0.8: Retrieved contexts are relevant")
print("- Context Recall greater than 0.9: All necessary information was retrieved")
# For production: Save results for tracking
# results_df = results.to_pandas()
# results_df.to_csv("rag_evaluation_results.csv", index=False)
from pprint import pprint
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
CreateEvalJSONLRunDataSourceParam,
SourceFileContentContent,
from dotenv import load_dotenv
# Configuration for Azure AI Project
endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4o-mini")
# Define test data for RAG evaluation
query = "What is the cheapest available tent of Contoso Outdoor?"
"Contoso Outdoor is a leading retailer specializing in outdoor gear and equipment. "
"Contoso Product Catalog: 1. tent A - $99.99, lightweight 2-person tent; "
"2. tent B - $149.99, 4-person family tent; tent C - $199.99, durable 6-person expedition tent."
response = "The cheapest available tent is tent A, priced at $99.99."
ground_truth = "The cheapest available tent is tent A, priced at $99.99."
# Define evaluation criteria for system and process evaluation
# System evaluation: Groundedness (faithfulness)
"type": "azure_ai_evaluator",
"evaluator_name": "builtin.groundedness",
"initialization_parameters": {
"deployment_name": f"{model_deployment_name}"
"context": "{{item.context}}",
"query": "{{item.query}}",
"response": "{{item.response}}"
# System evaluation: Relevance
"type": "azure_ai_evaluator",
"evaluator_name": "builtin.relevance",
"initialization_parameters": {
"deployment_name": f"{model_deployment_name}"
"query": "{{item.query}}",
"response": "{{item.response}}"
# Process evaluation: Retrieval
"type": "azure_ai_evaluator",
"evaluator_name": "builtin.retrieval",
"initialization_parameters": {
"deployment_name": f"{model_deployment_name}"
"context": "{{item.context}}",
"query": "{{item.query}}"
with DefaultAzureCredential() as credential:
with AIProjectClient(endpoint=endpoint, credential=credential) as project_client:
client = project_client.get_openai_client()
# Create evaluation group
"context": {"type": "string"},
"query": {"type": "string"},
"response": {"type": "string"},
"ground_truth": {"type": "string"}
"include_sample_schema": True
eval_object = client.evals.create(
name="RAG Evaluation: Faithfulness and Relevance",
data_source_config=data_source_config,
testing_criteria=testing_criteria
# Create evaluation run with inline data
eval_run = client.evals.runs.create(
metadata={"scenario": "rag-metrics"},
data_source=CreateEvalJSONLRunDataSourceParam(
source=SourceFileContent(
SourceFileContentContent(
"ground_truth": ground_truth
run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
if run.status in ["completed", "failed"]:
output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
print("Evaluation Results:")
print(f"Status: {run.status}")
print(f"Report URL: {run.report_url}")
print("Waiting for evaluation to complete...")
print(f"Error during evaluation: {e}")
from openai import Client
# Initialize OpenAI client for evaluation
client = Client(api_key=os.environ.get("OPENAI_API_KEY"))
def extract_claims(query: str, response: str, model: str = "gpt-4o-mini") -> str:
Extract verifiable claims from a RAG response.
A claim is any sentence or part expressing a verifiable fact.
"You are shown a prompt and a completion. Identify the main claims "
"in the completion. A claim is any sentence or part that expresses "
"a verifiable fact. Return a bullet list, one claim per line. "
"No explanations, just the claims."
COMPLETION: {response}"""
response = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
return response.choices[0].message.content
print(f"Error extracting claims: {e}")
def assess_claims(query: str, claims: str, context: str, model: str = "gpt-4o-mini") -> str:
Assess which claims are supported by the context.
Returns claims with SUPPORTED=1 or SUPPORTED=0 tags.
"You are shown a prompt, context, and list of claims. "
"Check which claims are supported by the context. "
"Return the list exactly as is, appending SUPPORTED=1 if supported, "
"SUPPORTED=0 if not. No explanations."
response = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
return response.choices[0].message.content
print(f"Error assessing claims: {e}")
def calculate_faithfulness(assessed_claims: str) -> float:
Calculate faithfulness score: proportion of supported claims.
supported = len(re.findall(r"SUPPORTED=1", assessed_claims))
total = supported + len(re.findall(r"SUPPORTED=0", assessed_claims))
return supported / total if total > 0 else 0.0
def evaluate_rag_response(query: str, response: str, retrieved_contexts: list) -> dict:
Complete RAG evaluation using claim-based approach.
Returns metrics for faithfulness and correctness.
# Step 1: Extract claims from response
claims = extract_claims(query, response)
print(f"Extracted claims:\n{claims}\n")
# Step 2: Assess claims against retrieved contexts (Faithfulness)
context_str = "\n".join(retrieved_contexts)
assessed_faithfulness = assess_claims(query, claims, context_str)
faithfulness_score = calculate_faithfulness(assessed_faithfulness)
print(f"Faithfulness assessment:\n{assessed_faithfulness}")
print(f"Faithfulness Score: {faithfulness_score:.2f}\n")
"faithfulness": faithfulness_score,
"claims_extracted": claims,
"claims_assessed": assessed_faithfulness
if __name__ == "__main__":
query = "How has Apple's total net sales changed over time?"
"Apple's total net sales experienced a decline over the last year. "
"The three-month period ended July 1, 2023, saw total net sales of $81,797 million, "
"which was a 1% decrease from the same period in 2022."
"Products and Services Performance: Total net sales $81,797 million for July 1, 2023.",
"Comparison: Total net sales $82,959 million for June 25, 2022."
results = evaluate_rag_response(query, response, contexts)
if results["faithfulness"] >= 0.9:
print("✓ High faithfulness: Response is well-grounded")
elif results["faithfulness"] >= 0.7:
print("⚠ Moderate faithfulness: Some claims may need verification")
print("✗ Low faithfulness: Significant hallucination risk")
from azure.ai.evaluation import DocumentRetrievalEvaluator
# Define ground truth relevance labels for documents
# Labels typically come from human or LLM judges
retrieval_ground_truth = [
{"document_id": "1", "query_relevance_label": 4}, # Highly relevant
{"document_id": "2", "query_relevance_label": 2}, # Moderately relevant
{"document_id": "3", "query_relevance_label": 3}, # Relevant
{"document_id": "4", "query_relevance_label": 1}, # Slightly relevant
{"document_id": "5", "query_relevance_label": 0}, # Not relevant
ground_truth_label_min = 0
ground_truth_label_max = 4
# Retrieved documents from your search system
# These include relevance scores from your retriever
{"document_id": "2", "relevance_score": 45.1}, # Correctly ranked high
{"document_id": "6", "relevance_score": 35.8}, # Unknown document
{"document_id": "3", "relevance_score": 29.2}, # Correctly ranked
{"document_id": "5", "relevance_score": 25.4}, # Should be low relevance
{"document_id": "7", "relevance_score": 18.8}, # Unknown document
# Initialize evaluator with custom thresholds
evaluator = DocumentRetrievalEvaluator(
ground_truth_label_min=ground_truth_label_min,
ground_truth_label_max=ground_truth_label_max,
# Override default thresholds for pass/fail
top1_relevance_threshold=50.0,
top3_max_relevance_threshold=50.0,
total_retrieved_documents_threshold=50,
total_ground_truth_documents_threshold=50
retrieval_ground_truth=retrieval_ground_truth,
retrieved_documents=retrieved_documents
print("Document Retrieval Evaluation Results:")
print(f"NDCG@3: {results.get('ndcg@3', 'N/A'):.4f} - {results.get('ndcg@3_result', 'N/A')}")
print(f"Fidelity: {results.get('fidelity', 'N/A'):.4f} - {results.get('fidelity_result', 'N/A')}")
print(f"XDCG@3: {results.get('xdcg@3', 'N/A'):.4f} - {results.get('xdcg@3_result', 'N/A')}")
print(f"Top-1 Relevance: {results.get('top1_relevance', 'N/A')} - {results.get('top1_relevance_result', 'N/A')}")
print(f"Holes: {results.get('holes', 'N/A')} (lower is better)")
print("\nInterpretation:")
if results.get('ndcg@3_result') == 'pass':
print("✓ Ranking quality is good (NDCG passed)")
print("✗ Ranking quality needs improvement")
if results.get('fidelity_result') == 'pass':
print("✓ Retrieved documents reflect query requirements well")
print("✗ Fidelity issues: may need better top-k or chunking")
print(f"Evaluation error: {e}")
# Parameter sweep recommendation:
# Use these metrics to optimize search parameters:
# - Try different top_k values (5, 10, 15, 20)
# - Test vector vs. semantic search
# - Adjust chunk sizes (256, 512, 1024 tokens)
# - Compare embedding models
# Run evaluator on each configuration and select parameters with highest NDCG and Fidelity
import { evaluate } from 'ragas';
import { ChatOpenAI } from '@langchain/openai';
import { LangchainLLMWrapper } from 'ragas/dist/llms';
import { Dataset } from 'datasets';
// Define RAG pipeline results
user_input: "What are the hardware requirements for building Milvus from source?",
"Hardware Requirements: 8GB of RAM, 50GB of free disk space",
"Building Milvus on Linux requires Ubuntu or CentOS systems"
response: "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
reference: "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
// Initialize evaluator LLM
const llm = new ChatOpenAI({
const evaluatorLLM = new LangchainLLMWrapper(llm);
async function evaluateRAG() {
console.log("Starting RAG evaluation...");
const results = await evaluate({
console.log("\nEvaluation Results:");
for (const [metric, score] of Object.entries(results)) {
console.log(`${metric}: ${score.toFixed(4)}`);
console.log("\nInterpretation:");
console.log("- Faithfulness > 0.9: Response is well-grounded");
console.log("- Answer Relevancy > 0.85: Response directly addresses query");
evaluateRAG().catch(console.error);
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
async function extractClaims(query: string, response: string, model = "gpt-4o-mini"): Promise<string> {
const preamble = `You are shown a prompt and a completion. Identify the main claims
in the completion. A claim is any sentence or part that expresses
a verifiable fact. Return a bullet list, one claim per line.
No explanations, just the claims.`;
const prompt = `${preamble}
COMPLETION: ${response}`;
const completion = await client.chat.completions.create({
messages: [{ role: "user", content: prompt }],
return completion.choices[0].message.content || "";
async function assessClaims(query: string, claims: string, context: string, model = "gpt-4o-mini"): Promise<string> {
const preamble = `You are shown a prompt, context, and list of claims.
Check which claims are supported by the context.
Return the list exactly as is, appending SUPPORTED=1 if supported,
SUPPORTED=0 if not. No explanations.`;
const prompt = `${preamble}
const completion = await client.chat.completions.create({
messages: [{ role: "user", content: prompt }],
return completion.choices[0].message.content || "";
function calculateFaithfulness(assessedClaims: string): number {
const supported = (assessedClaims.match(/SUPPORTED=1/g) || []).length;
const total = supported + (assessedClaims.match(/SUPPORTED=0/g) || []).length;
return total > 0 ? supported / total : 0;
async function evaluateRAGResponse(query: string, response: string, retrievedContexts: string[]) {
const claims = await extractClaims(query, response);
console.log(`Extracted claims:\n${claims}\n`);
const contextStr = retrievedContexts.join('\n');
const assessed = await assessClaims(query, claims, contextStr);
const faithfulness = calculateFaithfulness(assessed);
console.log(`Faithfulness assessment:\n${assessed}`);
console.log(`Faithfulness Score: ${faithfulness.toFixed(2)}\n`);
return { faithfulness, claims, assessed };
const query = "How has Apple's total net sales changed over time?";
const response = `Apple's total net sales experienced a decline over the last year.
The three-month period ended July 1, 2023, saw total net sales of $81,797 million,
which was a 1% decrease from the same period in 2022.`;
"Products and Services Performance: Total net sales $81,797 million for July 1, 2023.",
"Comparison: Total net sales $82,959 million for June 25, 2022."
evaluateRAGResponse(query, response, contexts)
if (results.faithfulness >= 0.9) {
console.log("✓ High faithfulness: Response is well-grounded");
} else if (results.faithfulness >= 0.7) {
console.log("⚠ Moderate faithfulness: Some claims may need verification");
console.log("✗ Low faithfulness: Significant hallucination risk");
Critical Evaluation Mistakes
These pitfalls can invalidate your entire evaluation process, leading to false confidence in broken systems.
Self-Preference Bias : Using the same LLM for both RAG generation and evaluation introduces bias. The model favors its own writing style and may overlook factual errors. Solution : Always use a separate, strong evaluator model (GPT-4o, Claude 3.5 Sonnet).
Single-Score Obsession : Relying on aggregate metrics (e.g., “faithfulness = 0.85”) without examining claim-level breakdowns. You miss specific failure modes like numeric hallucinations or temporal errors. Solution : Analyze individual claims and failure patterns.
Ignoring Position Sensitivity : Critical evidence in the middle of long contexts often gets overlooked (the “Lost in the Middle” problem). Solution : Test retrieval quality across different context positions and chunk orders.
Metadata Bias : Evaluator LLMs can be swayed by source prestige or author names in contexts. Solution : Run counterfactual tests—swap high-prestige and low-prestige sources while keeping content identical.
Weak Evaluator Models : Using GPT-3.5 or Haiku for evaluation saves money but produces unreliable judgments. Evaluation is a complex reasoning task. Solution : Use frontier models (GPT-4o, Claude 3.5) for critical systems; validate cheaper models against human judgments.
No Versioning : Not tracking evaluation dataset versions, rubrics, and prompts makes it impossible to measure improvement. Solution : Git-track all evaluation artifacts and log every run.
Production Disconnect : Evaluating on synthetic data that doesn’t match production query distribution. Solution : Continuously sample production queries (with privacy safeguards) and add them to your evaluation set.
Skipping Human Audits : Blindly trusting LLM-judge outputs without periodic human review. Solution : For high-impact scenarios, have domain experts audit 5-10% of evaluations monthly.
RAG evaluation costs scale with dataset size and metric complexity. Here’s the current pricing landscape:
Model Input Cost Output Cost Context Window Use Case GPT-4o $5.00/1M $15.00/1M 128K High-accuracy evaluation GPT-4o-mini $0.15/1M $0.60/1M 128K Cost-efficient evaluation Claude 3.5 Sonnet $3.00/1M $15.00/1M 200K Balanced accuracy/cost Haiku-3.5 $1.25/1M $5.00/1M 200K Fast, cheap evaluation
Evaluation Cost Formula
Claim-based evaluation requires 2-3 LLM calls per query: extract claims, assess faithfulness, and optionally verify against gold answers. For a 1,000-query dataset:
GPT-4o-mini: ~$0.003 per query = $3 total
GPT-4o: ~$0.10 per query = $100 total
Claude 3.5 Sonnet: ~$0.06 per query = $60 total
Pricing data sourced from OpenAI and Anthropic (October-November 2024).
Cost Optimization Strategies :
Use GPT-4o-mini for 90% of evaluations : Only use GPT-4o for critical failure analysis.
Batch processing : Many frameworks support batch evaluation to reduce API overhead.
Early stopping : If faithfulness drops below 0.5, stop the evaluation and fix the pipeline.
Sampling : For large datasets, evaluate on a statistically significant sample (e.g., 100-200 queries).
RAG evaluator (input query+context+response → metric scores)
Interactive widget derived from “RAG Evaluation Metrics: Faithfulness, Relevance, Precision” that lets readers explore rag evaluator (input query+context+response → metric scores).
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.
Metric What It Measures Good Score Red Flag Action If Low Faithfulness Claims supported by context greater than 0.9 less than 0.7 Reduce hallucinations: add context, lower temp, better prompts Answer Relevancy Response addresses query greater than 0.85 less than 0.7 Improve retrieval: better embeddings, increase top-k, re-rank Context Precision Retrieved docs are relevant greater than 0.8 less than 0.6 Tune search: adjust chunk size, try hybrid search Context Recall All necessary docs retrieved greater than 0.9 less than 0.75 Expand search: more sources, better indexing NDCG@3 Ranking quality greater than 0.7 less than 0.5 Re-rank results, tune similarity thresholds Fidelity Query requirements met greater than 0.7 less than 0.5 Increase top-k, improve query understanding
Faithfulness is your precision guardrail—measure it first to catch hallucinations.
Answer Relevance ensures you’re not missing critical information (recall).
Context Precision diagnoses retrieval problems before they contaminate generation.
Claim-based evaluation provides granular insight into specific failure modes.
Separate evaluator LLM is non-negotiable to avoid bias.
Cost-efficient evaluation is possible with GPT-4o-mini or Haiku-3.5.
Automate evaluation in CI/CD to catch regressions before production.
Evals Hub Complete guide to LLM evaluation strategies and frameworks