RAG Evaluation Metrics: Faithfulness, Relevance, Precision

RAG systems fail silently. Your pipeline can return confident-sounding answers that are complete fabrications, or retrieve irrelevant documents that waste tokens and money. Without proper evaluation, these failures compound—costing thousands in API fees while eroding user trust. This guide covers the three critical RAG metrics (faithfulness, relevance, precision) with production-ready implementations to catch these failures before they reach production.

Why RAG Evaluation Matters

RAG evaluation isn’t optional—it’s the difference between a system that works and one that appears to work. Consider these real-world impacts:

Cost Impact: A financial reporting RAG system evaluated by Docugami achieved 100% faithfulness using claim-based evaluation. When the same response was corrupted (changing “2022” to “1922”), faithfulness dropped to 50%, correctly identifying the hallucination. Without this metric, the error would have cost millions in bad financial advice Cohere, 2024.

Retrieval Quality: Microsoft Foundry users discovered their retrieval system passed NDCG@3 (0.646) but failed Fidelity (0.019), meaning ranking looked good but missed critical documents. This guided parameter sweeps that improved retrieval by testing vector vs. semantic search, different top-k values, and chunk sizes Microsoft Foundry, 2024.

API Costs: Using GPT-4o-mini ($0.15/$0.60 per 1M tokens) for evaluation instead of GPT-4o ($5/$15) reduces evaluation costs by 30x while maintaining quality for most metrics. For a 10,000-query test set, that’s $9 vs. $300 per evaluation run OpenAI Pricing, 2024.

Core RAG Metrics Explained

Faithfulness: The Precision Metric

Faithfulness measures whether your generated response contains only information supported by retrieved contexts. It’s the primary defense against hallucinations.

How It Works: The evaluator decomposes your response into individual claims (verifiable statements), then checks each claim against the retrieved context. If a claim isn’t supported, faithfulness drops.

Why It Matters: A faithfulness score below 0.9 indicates your model is adding information not present in sources. For legal or financial RAG, this is unacceptable.

Answer Relevance: The Recall Metric

Answer relevance evaluates whether the response comprehensively addresses the user’s query without missing critical information.

Components:

Accuracy: Does the answer match the query intent?
Completeness: Are all aspects of the question addressed?
Directness: Is the response focused or meandering?

Scoring: High relevance means the response contains all necessary information to satisfy the query. Low relevance indicates gaps that require follow-up questions.

Context Precision: The Retrieval Quality Metric

Context precision measures whether retrieved documents are actually relevant to the query. It’s calculated using Mean Average Precision (mAP) against ground truth labels.

Why Separate Retrieval Evaluation?: If your faithfulness is high but relevance is low, the problem is retrieval—you’re getting irrelevant documents. If faithfulness is low but relevance is high, the problem is generation—your model is hallucinating despite good context.

Practical Implementation

Define your evaluation dataset: Collect 50-100 diverse queries with retrieved contexts, generated responses, and ground truth answers. Include edge cases: ambiguous queries, multi-hop questions, and queries requiring synthesis.
Choose your metrics: Start with Faithfulness, Answer Relevancy, and Context Precision. Add Context Recall if you need to debug retrieval gaps.
Select evaluator LLM: Use a model different from your RAG LLM to avoid bias. For cost-efficiency, use GPT-4o-mini or Haiku-3.5. For maximum accuracy, use GPT-4o or Claude 3.5 Sonnet.
Run baseline evaluation: Execute metrics on your current pipeline to establish baseline scores.
Iterate and optimize: Use results to tune retrieval (chunk size, top-k, search type) and generation (prompt engineering, temperature).
Automate in CI/CD: Run evaluations on every pipeline change to catch regressions.

Code Examples

from ragas import evaluate
from ragas.metrics import (
    AnswerRelevancy,
    Faithfulness,
    ContextRecall,
    ContextPrecision
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from datasets import Dataset
import pandas as pd

# Define RAG pipeline results
rag_data = [
    {
        "user_input": "What are the hardware requirements for building Milvus from source?",
        "retrieved_contexts": [
            "Hardware Requirements: 8GB of RAM, 50GB of free disk space",
            "Building Milvus on Linux requires Ubuntu or CentOS systems"
        ],
        "response": "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
        "reference": "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
    },
    {
        "user_input": "What programming language is used for Knowhere?",
        "retrieved_contexts": [
            "Knowhere is the algorithm library of Milvus",
            "The library is written in C++ for performance"
        ],
        "response": "Knowhere is written in C++.",
        "reference": "The programming language used to write Knowhere is C++."
    }
]

# Convert to Ragas EvaluationDataset
df = pd.DataFrame(rag_data)
dataset = Dataset.from_pandas(df)

# Initialize evaluator LLM (different from RAG LLM to avoid bias)
# Use GPT-4o-mini for cost-efficiency or GPT-4o for accuracy
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
evaluator_llm = LangchainLLMWrapper(llm)

# Define metrics
# Faithfulness: Measures if response claims are supported by contexts
# Answer Relevancy: Measures relevance to query
# Context Precision: Measures if retrieved contexts are relevant
# Context Recall: Measures if contexts contain all ground truth
metrics = [
    Faithfulness(llm=evaluator_llm),
    AnswerRelevancy(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm)
]

# Execute evaluation
print("Starting RAG evaluation with Ragas...")
results = evaluate(dataset=dataset, metrics=metrics)

# Display results
print("\nEvaluation Results:")
for metric_name, score in results.items():
    print(f"{metric_name}: {score:.4f}")

# Interpretation guide
print("\nInterpretation:")
print("- Faithfulness greater than 0.9: Response is well-grounded in retrieved contexts")
print("- Answer Relevancy greater than 0.85: Response directly addresses the query")
print("- Context Precision greater than 0.8: Retrieved contexts are relevant")
print("- Context Recall greater than 0.9: All necessary information was retrieved")

# For production: Save results for tracking
# results_df = results.to_pandas()
# results_df.to_csv("rag_evaluation_results.csv", index=False)

import os
import json
import time
from pprint import pprint
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
    CreateEvalJSONLRunDataSourceParam,
    SourceFileContent,
    SourceFileContentContent,
)
from dotenv import load_dotenv

load_dotenv()

# Configuration for Azure AI Project
endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
model_deployment_name = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4o-mini")

# Define test data for RAG evaluation
query = "What is the cheapest available tent of Contoso Outdoor?"
context = (
    "Contoso Outdoor is a leading retailer specializing in outdoor gear and equipment. "
    "Contoso Product Catalog: 1. tent A - $99.99, lightweight 2-person tent; "
    "2. tent B - $149.99, 4-person family tent; tent C - $199.99, durable 6-person expedition tent."
)
response = "The cheapest available tent is tent A, priced at $99.99."
ground_truth = "The cheapest available tent is tent A, priced at $99.99."

# Define evaluation criteria for system and process evaluation
testing_criteria = [
    # System evaluation: Groundedness (faithfulness)
    {
        "type": "azure_ai_evaluator",
        "name": "groundedness",
        "evaluator_name": "builtin.groundedness",
        "initialization_parameters": {
            "deployment_name": f"{model_deployment_name}"
        },
        "data_mapping": {
            "context": "{{item.context}}",
            "query": "{{item.query}}",
            "response": "{{item.response}}"
        }
    },
    # System evaluation: Relevance
    {
        "type": "azure_ai_evaluator",
        "name": "relevance",
        "evaluator_name": "builtin.relevance",
        "initialization_parameters": {
            "deployment_name": f"{model_deployment_name}"
        },
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{item.response}}"
        }
    },
    # Process evaluation: Retrieval
    {
        "type": "azure_ai_evaluator",
        "name": "retrieval",
        "evaluator_name": "builtin.retrieval",
        "initialization_parameters": {
            "deployment_name": f"{model_deployment_name}"
        },
        "data_mapping": {
            "context": "{{item.context}}",
            "query": "{{item.query}}"
        }
    }
]

# Execute evaluation
try:
    with DefaultAzureCredential() as credential:
        with AIProjectClient(endpoint=endpoint, credential=credential) as project_client:
            client = project_client.get_openai_client()

            # Create evaluation group
            data_source_config = {
                "type": "custom",
                "item_schema": {
                    "type": "object",
                    "properties": {
                        "context": {"type": "string"},
                        "query": {"type": "string"},
                        "response": {"type": "string"},
                        "ground_truth": {"type": "string"}
                    },
                    "required": ["response"]
                },
                "include_sample_schema": True
            }

            eval_object = client.evals.create(
                name="RAG Evaluation: Faithfulness and Relevance",
                data_source_config=data_source_config,
                testing_criteria=testing_criteria
            )

            # Create evaluation run with inline data
            eval_run = client.evals.runs.create(
                eval_id=eval_object.id,
                name="inline_data_run",
                metadata={"scenario": "rag-metrics"},
                data_source=CreateEvalJSONLRunDataSourceParam(
                    type="jsonl",
                    source=SourceFileContent(
                        type="file_content",
                        content=[
                            SourceFileContentContent(
                                item={
                                    "context": context,
                                    "response": response,
                                    "query": query,
                                    "ground_truth": ground_truth
                                }
                            )
                        ]
                    )
                )
            )

            # Poll for completion
            while True:
                run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)
                if run.status in ["completed", "failed"]:
                    output_items = list(client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id))
                    print("Evaluation Results:")
                    pprint(output_items)
                    print(f"Status: {run.status}")
                    if run.report_url:
                        print(f"Report URL: {run.report_url}")
                    break
                time.sleep(5)
                print("Waiting for evaluation to complete...")

except Exception as e:
    print(f"Error during evaluation: {e}")
    raise

import re
from openai import Client
import os

# Initialize OpenAI client for evaluation
client = Client(api_key=os.environ.get("OPENAI_API_KEY"))

def extract_claims(query: str, response: str, model: str = "gpt-4o-mini") -> str:
    """
    Extract verifiable claims from a RAG response.
    A claim is any sentence or part expressing a verifiable fact.
    """
    preamble = (
        "You are shown a prompt and a completion. Identify the main claims "
        "in the completion. A claim is any sentence or part that expresses "
        "a verifiable fact. Return a bullet list, one claim per line. "
        "No explanations, just the claims."
    )

    prompt = f"""{preamble}

PROMPT: {query}

COMPLETION: {response}"""

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error extracting claims: {e}")
        return ""

def assess_claims(query: str, claims: str, context: str, model: str = "gpt-4o-mini") -> str:
    """
    Assess which claims are supported by the context.
    Returns claims with SUPPORTED=1 or SUPPORTED=0 tags.
    """
    preamble = (
        "You are shown a prompt, context, and list of claims. "
        "Check which claims are supported by the context. "
        "Return the list exactly as is, appending SUPPORTED=1 if supported, "
        "SUPPORTED=0 if not. No explanations."
    )

    prompt = f"""{preamble}

PROMPT: {query}

CONTEXT:
{context}

CLAIMS:
{claims}"""

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error assessing claims: {e}")
        return ""

def calculate_faithfulness(assessed_claims: str) -> float:
    """
    Calculate faithfulness score: proportion of supported claims.
    Range: 0.0 to 1.0
    """
    supported = len(re.findall(r"SUPPORTED=1", assessed_claims))
    total = supported + len(re.findall(r"SUPPORTED=0", assessed_claims))
    return supported / total if total > 0 else 0.0

def evaluate_rag_response(query: str, response: str, retrieved_contexts: list) -> dict:
    """
    Complete RAG evaluation using claim-based approach.
    Returns metrics for faithfulness and correctness.
    """
    # Step 1: Extract claims from response
    claims = extract_claims(query, response)
    print(f"Extracted claims:\n{claims}\n")

    # Step 2: Assess claims against retrieved contexts (Faithfulness)
    context_str = "\n".join(retrieved_contexts)
    assessed_faithfulness = assess_claims(query, claims, context_str)
    faithfulness_score = calculate_faithfulness(assessed_faithfulness)

    print(f"Faithfulness assessment:\n{assessed_faithfulness}")
    print(f"Faithfulness Score: {faithfulness_score:.2f}\n")

    return {
        "faithfulness": faithfulness_score,
        "claims_extracted": claims,
        "claims_assessed": assessed_faithfulness
    }

# Example usage
if __name__ == "__main__":
    # Test case
    query = "How has Apple's total net sales changed over time?"
    response = (
        "Apple's total net sales experienced a decline over the last year. "
        "The three-month period ended July 1, 2023, saw total net sales of $81,797 million, "
        "which was a 1% decrease from the same period in 2022."
    )
    contexts = [
        "Products and Services Performance: Total net sales $81,797 million for July 1, 2023.",
        "Comparison: Total net sales $82,959 million for June 25, 2022."
    ]

    results = evaluate_rag_response(query, response, contexts)

    # Interpretation
    if results["faithfulness"] >= 0.9:
        print("✓ High faithfulness: Response is well-grounded")
    elif results["faithfulness"] >= 0.7:
        print("⚠ Moderate faithfulness: Some claims may need verification")
    else:
        print("✗ Low faithfulness: Significant hallucination risk")

from azure.ai.evaluation import DocumentRetrievalEvaluator

# Define ground truth relevance labels for documents
# Labels typically come from human or LLM judges
retrieval_ground_truth = [
    {"document_id": "1", "query_relevance_label": 4},  # Highly relevant
    {"document_id": "2", "query_relevance_label": 2},  # Moderately relevant
    {"document_id": "3", "query_relevance_label": 3},  # Relevant
    {"document_id": "4", "query_relevance_label": 1},  # Slightly relevant
    {"document_id": "5", "query_relevance_label": 0},  # Not relevant
]

# Define label range
ground_truth_label_min = 0
ground_truth_label_max = 4

# Retrieved documents from your search system
# These include relevance scores from your retriever
retrieved_documents = [
    {"document_id": "2", "relevance_score": 45.1},  # Correctly ranked high
    {"document_id": "6", "relevance_score": 35.8},  # Unknown document
    {"document_id": "3", "relevance_score": 29.2},  # Correctly ranked
    {"document_id": "5", "relevance_score": 25.4},  # Should be low relevance
    {"document_id": "7", "relevance_score": 18.8},  # Unknown document
]

# Initialize evaluator with custom thresholds
evaluator = DocumentRetrievalEvaluator(
    ground_truth_label_min=ground_truth_label_min,
    ground_truth_label_max=ground_truth_label_max,
    # Override default thresholds for pass/fail
    ndcg_threshold=0.5,
    fidelity_threshold=0.5,
    top1_relevance_threshold=50.0,
    top3_max_relevance_threshold=50.0,
    total_retrieved_documents_threshold=50,
    total_ground_truth_documents_threshold=50
)

# Execute evaluation
try:
    results = evaluator(
        retrieval_ground_truth=retrieval_ground_truth,
        retrieved_documents=retrieved_documents
    )

    print("Document Retrieval Evaluation Results:")
    print(f"NDCG@3: {results.get('ndcg@3', 'N/A'):.4f} - {results.get('ndcg@3_result', 'N/A')}")
    print(f"Fidelity: {results.get('fidelity', 'N/A'):.4f} - {results.get('fidelity_result', 'N/A')}")
    print(f"XDCG@3: {results.get('xdcg@3', 'N/A'):.4f} - {results.get('xdcg@3_result', 'N/A')}")
    print(f"Top-1 Relevance: {results.get('top1_relevance', 'N/A')} - {results.get('top1_relevance_result', 'N/A')}")
    print(f"Holes: {results.get('holes', 'N/A')} (lower is better)")

    # Interpretation
    print("\nInterpretation:")
    if results.get('ndcg@3_result') == 'pass':
        print("✓ Ranking quality is good (NDCG passed)")
    else:
        print("✗ Ranking quality needs improvement")

    if results.get('fidelity_result') == 'pass':
        print("✓ Retrieved documents reflect query requirements well")
    else:
        print("✗ Fidelity issues: may need better top-k or chunking")

except Exception as e:
    print(f"Evaluation error: {e}")
    raise

# Parameter sweep recommendation:
# Use these metrics to optimize search parameters:
# - Try different top_k values (5, 10, 15, 20)
# - Test vector vs. semantic search
# - Adjust chunk sizes (256, 512, 1024 tokens)
# - Compare embedding models
# Run evaluator on each configuration and select parameters with highest NDCG and Fidelity

TypeScript: Ragas Wrapper
TypeScript: Claim-Based

import { evaluate } from 'ragas';
import { ChatOpenAI } from '@langchain/openai';
import { LangchainLLMWrapper } from 'ragas/dist/llms';
import { Dataset } from 'datasets';

// Define RAG pipeline results
const ragData = [
  {
    user_input: "What are the hardware requirements for building Milvus from source?",
    retrieved_contexts: [
      "Hardware Requirements: 8GB of RAM, 50GB of free disk space",
      "Building Milvus on Linux requires Ubuntu or CentOS systems"
    ],
    response: "The hardware requirements are 8GB of RAM and 50GB of free disk space.",
    reference: "For building Milvus from source, you need 8GB of RAM and 50GB of free disk space."
  }
];

// Initialize evaluator LLM
const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0
});
const evaluatorLLM = new LangchainLLMWrapper(llm);

// Define metrics
const metrics = [
  "faithfulness",
  "answer_relevancy",
  "context_precision",
  "context_recall"
];

// Execute evaluation
async function evaluateRAG() {
  console.log("Starting RAG evaluation...");
  const results = await evaluate({
    dataset: ragData,
    metrics: metrics,
    llm: evaluatorLLM
  });

  console.log("\nEvaluation Results:");
  for (const [metric, score] of Object.entries(results)) {
    console.log(`${metric}: ${score.toFixed(4)}`);
  }

  // Interpretation
  console.log("\nInterpretation:");
  console.log("- Faithfulness > 0.9: Response is well-grounded");
  console.log("- Answer Relevancy > 0.85: Response directly addresses query");
}

evaluateRAG().catch(console.error);

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractClaims(query: string, response: string, model = "gpt-4o-mini"): Promise<string> {
  const preamble = `You are shown a prompt and a completion. Identify the main claims
    in the completion. A claim is any sentence or part that expresses
    a verifiable fact. Return a bullet list, one claim per line.
    No explanations, just the claims.`;

  const prompt = `${preamble}

PROMPT: ${query}

COMPLETION: ${response}`;

  const completion = await client.chat.completions.create({
    model: model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0
  });

  return completion.choices[0].message.content || "";
}

async function assessClaims(query: string, claims: string, context: string, model = "gpt-4o-mini"): Promise<string> {
  const preamble = `You are shown a prompt, context, and list of claims.
    Check which claims are supported by the context.
    Return the list exactly as is, appending SUPPORTED=1 if supported,
    SUPPORTED=0 if not. No explanations.`;

  const prompt = `${preamble}

PROMPT: ${query}

CONTEXT:
${context}

CLAIMS:
${claims}`;

  const completion = await client.chat.completions.create({
    model: model,
    messages: [{ role: "user", content: prompt }],
    temperature: 0
  });

  return completion.choices[0].message.content || "";
}

function calculateFaithfulness(assessedClaims: string): number {
  const supported = (assessedClaims.match(/SUPPORTED=1/g) || []).length;
  const total = supported + (assessedClaims.match(/SUPPORTED=0/g) || []).length;
  return total > 0 ? supported / total : 0;
}

async function evaluateRAGResponse(query: string, response: string, retrievedContexts: string[]) {
  const claims = await extractClaims(query, response);
  console.log(`Extracted claims:\n${claims}\n`);

  const contextStr = retrievedContexts.join('\n');
  const assessed = await assessClaims(query, claims, contextStr);
  const faithfulness = calculateFaithfulness(assessed);

  console.log(`Faithfulness assessment:\n${assessed}`);
  console.log(`Faithfulness Score: ${faithfulness.toFixed(2)}\n`);

  return { faithfulness, claims, assessed };
}

// Example usage
const query = "How has Apple's total net sales changed over time?";
const response = `Apple's total net sales experienced a decline over the last year.
  The three-month period ended July 1, 2023, saw total net sales of $81,797 million,
  which was a 1% decrease from the same period in 2022.`;
const contexts = [
  "Products and Services Performance: Total net sales $81,797 million for July 1, 2023.",
  "Comparison: Total net sales $82,959 million for June 25, 2022."
];

evaluateRAGResponse(query, response, contexts)
  .then(results => {
    if (results.faithfulness >= 0.9) {
      console.log("✓ High faithfulness: Response is well-grounded");
    } else if (results.faithfulness >= 0.7) {
      console.log("⚠ Moderate faithfulness: Some claims may need verification");
    } else {
      console.log("✗ Low faithfulness: Significant hallucination risk");
    }
  })
  .catch(console.error);

Common Pitfalls

Self-Preference Bias: Using the same LLM for both RAG generation and evaluation introduces bias. The model favors its own writing style and may overlook factual errors. Solution: Always use a separate, strong evaluator model (GPT-4o, Claude 3.5 Sonnet).
Single-Score Obsession: Relying on aggregate metrics (e.g., “faithfulness = 0.85”) without examining claim-level breakdowns. You miss specific failure modes like numeric hallucinations or temporal errors. Solution: Analyze individual claims and failure patterns.
Ignoring Position Sensitivity: Critical evidence in the middle of long contexts often gets overlooked (the “Lost in the Middle” problem). Solution: Test retrieval quality across different context positions and chunk orders.
Metadata Bias: Evaluator LLMs can be swayed by source prestige or author names in contexts. Solution: Run counterfactual tests—swap high-prestige and low-prestige sources while keeping content identical.
Weak Evaluator Models: Using GPT-3.5 or Haiku for evaluation saves money but produces unreliable judgments. Evaluation is a complex reasoning task. Solution: Use frontier models (GPT-4o, Claude 3.5) for critical systems; validate cheaper models against human judgments.
No Versioning: Not tracking evaluation dataset versions, rubrics, and prompts makes it impossible to measure improvement. Solution: Git-track all evaluation artifacts and log every run.
Production Disconnect: Evaluating on synthetic data that doesn’t match production query distribution. Solution: Continuously sample production queries (with privacy safeguards) and add them to your evaluation set.
Skipping Human Audits: Blindly trusting LLM-judge outputs without periodic human review. Solution: For high-impact scenarios, have domain experts audit 5-10% of evaluations monthly.

Pricing & Cost Optimization

RAG evaluation costs scale with dataset size and metric complexity. Here’s the current pricing landscape:

Model	Input Cost	Output Cost	Context Window	Use Case
GPT-4o	$5.00/1M	$15.00/1M	128K	High-accuracy evaluation
GPT-4o-mini	$0.15/1M	$0.60/1M	128K	Cost-efficient evaluation
Claude 3.5 Sonnet	$3.00/1M	$15.00/1M	200K	Balanced accuracy/cost
Haiku-3.5	$1.25/1M	$5.00/1M	200K	Fast, cheap evaluation

Cost Optimization Strategies:

Use GPT-4o-mini for 90% of evaluations: Only use GPT-4o for critical failure analysis.
Batch processing: Many frameworks support batch evaluation to reduce API overhead.
Early stopping: If faithfulness drops below 0.5, stop the evaluation and fix the pipeline.
Sampling: For large datasets, evaluate on a statistically significant sample (e.g., 100-200 queries).

RAG evaluator (input query+context+response → metric scores)

Interactive widget derived from “RAG Evaluation Metrics: Faithfulness, Relevance, Precision” that lets readers explore rag evaluator (input query+context+response → metric scores).

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Quick Reference: Metric Interpretation

Metric	What It Measures	Good Score	Red Flag	Action If Low
Faithfulness	Claims supported by context	greater than 0.9	less than 0.7	Reduce hallucinations: add context, lower temp, better prompts
Answer Relevancy	Response addresses query	greater than 0.85	less than 0.7	Improve retrieval: better embeddings, increase top-k, re-rank
Context Precision	Retrieved docs are relevant	greater than 0.8	less than 0.6	Tune search: adjust chunk size, try hybrid search
Context Recall	All necessary docs retrieved	greater than 0.9	less than 0.75	Expand search: more sources, better indexing
NDCG@3	Ranking quality	greater than 0.7	less than 0.5	Re-rank results, tune similarity thresholds
Fidelity	Query requirements met	greater than 0.7	less than 0.5	Increase top-k, improve query understanding

Summary

Faithfulness is your precision guardrail—measure it first to catch hallucinations.
Answer Relevance ensures you’re not missing critical information (recall).
Context Precision diagnoses retrieval problems before they contaminate generation.
Claim-based evaluation provides granular insight into specific failure modes.
Separate evaluator LLM is non-negotiable to avoid bias.
Cost-efficient evaluation is possible with GPT-4o-mini or Haiku-3.5.
Automate evaluation in CI/CD to catch regressions before production.

Context Relevance Deep dive into optimizing document retrieval for RAG pipelines

Reranking Quality Techniques for improving result ordering and relevance

Evals Hub Complete guide to LLM evaluation strategies and frameworks

Vector DB Performance Optimize your vector database for faster, cheaper retrieval

RAG Evaluation Metrics: Faithfulness, Relevance, Precision

RAG Evaluation Metrics: Faithfulness, Relevance, Precision

Why RAG Evaluation Matters

Core RAG Metrics Explained

Faithfulness: The Precision Metric

Answer Relevance: The Recall Metric

Context Precision: The Retrieval Quality Metric

Practical Implementation

Code Examples

Common Pitfalls

Pricing & Cost Optimization

Widget

Quick Reference: Metric Interpretation

Summary

Related Resources