Skip to content
GitHubX/TwitterRSS

Regression Testing for LLMs: Preventing Quality Drops

Regression Testing for LLMs: Preventing Quality Drops

Section titled “Regression Testing for LLMs: Preventing Quality Drops”

A major e-commerce company deployed a new version of their product description generator and saw a 15% drop in customer engagement within 48 hours. The culprit? A subtle change in their prompt template that caused the model to generate overly verbose descriptions. Their system had no regression tests to catch the issue before it hit production. This guide provides a battle-tested framework for building automated regression suites that prevent quality drops.

LLM applications are uniquely vulnerable to silent regressions. Unlike traditional software where bugs typically cause crashes, LLM regressions manifest as subtle quality degradation—slightly less accurate answers, slower responses, or subtle formatting errors. The OpenAI Model Spec explicitly states that models should “Avoid factual, reasoning, and formatting errors” OpenAI Model Spec, but without systematic testing, you can’t verify this holds true across model versions.

The cost implications are significant. Running a full regression suite on GPT-4o can cost $50-200 per run depending on dataset size. However, catching a regression before production deployment is infinitely cheaper than the customer churn and engineering time required to fix a live issue. Companies using automated evals report catching 85-95% of quality regressions before deployment OpenAI Cookbook.

When a regression reaches production, the true cost extends beyond immediate fixes:

  • Customer trust erosion: Users lose confidence after repeated poor experiences
  • Emergency rollback overhead: Engineering teams scramble to revert changes
  • Opportunity cost: Time spent firefighting instead of building features
  • Model provider changes: Silent updates to model behavior can break your application overnight

LLM regression testing differs fundamentally from traditional software testing. You’re not checking for deterministic outputs, but rather for consistency across key quality dimensions.

A golden dataset is a curated collection of test cases that represent your application’s core functionality. Each entry typically contains:

  • Input: The messages/prompt sent to the model
  • Expected output: The ideal response (or criteria for evaluation)
  • Metadata: Tags for categorization, difficulty, or business criticality

There are three primary evaluation strategies for regression testing:

  1. Exact Match: For deterministic tasks (JSON generation, SQL queries, classification)
  2. Semantic Similarity: For creative tasks where phrasing matters less than meaning
  3. Model-Graded: For subjective quality (coherence, relevance, tone)

The OpenAI Cookbook demonstrates that model-graded evaluations using GPT-4 correlate strongly with human evaluators when using structured prompts OpenAI Cookbook.

Effective regression testing integrates at multiple stages:

  • Pre-commit: Fast, cheap tests on a small subset
  • Pre-deployment: Full suite on all models
  • Post-deployment: Continuous monitoring with canary traffic

Building an Automated Regression Test Suite

Section titled “Building an Automated Regression Test Suite”

Let’s build a production-ready regression testing framework that you can integrate into your CI/CD pipeline immediately.

Create a golden_dataset.jsonl file with your core test cases:

{"id": "sql_001", "input": [{"role": "system", "content": "You are a SQL assistant. Generate valid SQL queries."}, {"role": "user", "content": "How many users signed up in the last 7 days?"}], "ideal": "SELECT COUNT(*) FROM users WHERE created_at >= NOW() - INTERVAL '7 days'"}
{"id": "json_002", "input": [{"role": "user", "content": "Extract product info from: 'New iPhone 15 Pro, $999, in stock'"}], "ideal": "{\"product\": \"iPhone 15 Pro\", \"price\": 999, \"in_stock\": true}"}
{"id": "classification_003", "input": [{"role": "user", "content": "Classify sentiment: 'This product is amazing!'"}], "ideal": "positive"}

Use the following Python framework to run your tests:

import json
import time
from typing import List, Dict, Any
from openai import OpenAI
class RegressionTester:
def __init__(self, model: str = "gpt-4o-mini"):
self.client = OpenAI()
self.model = model
self.results = []
def load_goldens(self, path: str) -> List[Dict[str, Any]]:
"""Load golden dataset from JSONL file."""
goldens = []
with open(path, 'r') as f:
for line in f:
goldens.append(json.loads(line))
return goldens
def query_model(self, messages: List[Dict[str, str]]) -> tuple[str, float]:
"""Query model and return output with latency."""
start_time = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0.0
)
latency = (time.time() - start_time) * 1000 # ms
return response.choices[0].message.content, latency
def evaluate(self, golden: Dict[str, Any]) -> Dict[str, Any]:
"""Evaluate a single golden test case."""
try:
output, latency = self.query_model(golden["input"])
# Exact match evaluation
passed = output.strip() == golden["ideal"].strip()
return {
"id": golden["id"],
"passed": passed,
"latency_ms": latency,
"expected": golden["ideal"],
"actual": output,
"error": None
}
except Exception as e:
return {
"id": golden["id"],
"passed": False,
"latency_ms": 0,
"expected": golden["ideal"],
"actual": None,
"error": str(e)
}
def run(self, dataset_path: str) -> Dict[str, Any]:
"""Run full regression suite."""
goldens = self.load_goldens(dataset_path)
for golden in goldens:
result = self.evaluate(golden)
self.results.append(result)
# Summary
passed = sum(1 for r in self.results if r["passed"])
total = len(self.results)
avg_latency = sum(r["latency_ms"] for r in self.results) / total if total > 0 else 0
summary = {
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"model": self.model,
"total_tests": total,
"passed": passed,
"failed": total - passed,
"accuracy": round((passed / total) * 100, 2),
"avg_latency_ms": round(avg_latency, 2),
"details": self.results
}
return summary
def print_report(self, summary: Dict[str, Any]) -> None:
"""Print formatted report."""
print(f"\n{'='*60}")
print(f"Regression Test Report - {summary['timestamp']}")
print(f"{'='*60}")
print(f"Model: {summary['model']}")
print(f"Total Tests: {summary['total_tests']}")
print(f"Passed: {summary['passed']}")
print(f"Failed: {summary['failed']}")
print(f"Accuracy: {summary['accuracy']}%")
print(f"Avg Latency: {summary['avg_latency_ms']}ms")
print(f"{'='*60}\n")
if summary['failed'] > 0:
print("Failed Tests:")
for result in self.results:
if not result['passed']:
print(f" - {result['id']}: {result.get('error', 'Mismatch')}")
print()
if __name__ == "__main__":
tester = RegressionTester()
summary = tester.run("golden_dataset.jsonl")
tester.print_report(summary)
# Exit with error code if tests failed
exit(0 if summary['failed'] == 0 else 1)

This implementation provides:

  • Error handling for API failures
  • Latency tracking to catch performance regressions
  • CI/CD-friendly output with exit codes
  • Detailed logging for debugging failures

Add this to your GitHub Actions workflow:

name: LLM Regression Tests
on: [push, pull_request]
jobs:
regression-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install openai
- name: Run regression tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python regression_tester.py
if [ $? -ne 0 ]; then
echo "::error::Regression tests failed"
exit 1
fi

For subjective quality metrics, implement model-graded evaluation:

import json
from typing import List, Dict, Any
from openai import OpenAI
class ModelGradedEvaluator:
def __init__(self, judge_model: str = "gpt-4o", target_model: str = "gpt-4o-mini"):
self.client = OpenAI()
self.judge_model = judge_model
self.target_model = target_model
def grade_response(self, input_messages: List[Dict], expected: str, actual: str) -> Dict[str, Any]:
"""Use a judge model to grade the response quality."""
grading_prompt = [
{"role": "system", "content": "You are an expert evaluator. Grade the assistant's response based on correctness and quality. Return JSON with 'score' (0-100) and 'reasoning'."},
{"role": "user", "content": f"Input: {json.dumps(input_messages)}\n\nExpected: {expected}\n\nActual: {actual}\n\nGrade this response."}
]
response = self.client.chat.completions.create(
model=self.judge_model,
messages=grading_prompt,
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return {
"score": result.get("score", 0),
"reasoning": result.get("reasoning", ""),
"passed": result.get("score", 0) >= 80
}
def run_evaluation(self, dataset_path: str) -> Dict[str, Any]:
"""Run model-graded evaluation suite."""
with open(dataset_path, 'r') as f:
goldens = [json.loads(line) for line in f]
results = []
for golden in goldens:
# Get actual output from target model
actual_response = self.client.chat.completions.create(
model=self.target_model,
messages=golden["input"],
temperature=0.0
)
actual = actual_response.choices[0].message.content
# Grade with judge model
grade = self.grade_response(golden["input"], golden["ideal"], actual)
results.append({
"id": golden["id"],
"score": grade["score"],
"passed": grade["passed"],
"reasoning": grade["reasoning"],
"actual": actual
})
avg_score = sum(r["score"] for r in results) / len(results)
passed = sum(1 for r in results if r["passed"])
return {
"average_score": avg_score,
"pass_rate": (passed / len(results)) * 100,
"results": results
}
if __name__ == "__main__":
evaluator = ModelGradedEvaluator()
summary = evaluator.run_evaluation("golden_dataset.jsonl")
print(f"Average Score: {summary['average_score']:.2f}")
print(f"Pass Rate: {summary['pass_rate']:.2f}%")

This approach uses GPT-4o to grade responses from GPT-4o-mini, providing a more nuanced evaluation than exact matching.

Track metrics over time to detect drift:

# Add to your regression tester
import pandas as pd
def generate_report(self, results: Dict[str, Any]) -> None:
"""Generate a markdown report for tracking trends."""
df = pd.DataFrame(results['details'])
report = f"""
## Regression Test Report
- **Date**: {results['summary']['timestamp']}
- **Model**: {results['summary']['model']}
- **Accuracy**: {results['summary']['accuracy']}%
- **Avg Latency**: {results['summary']['avg_latency_ms']}ms
- **Total Tests**: {results['summary']['total_tests']}
"""
with open('regression_report.md', 'w') as f:
f.write(report)

For teams using TypeScript-based pipelines, here’s a production-ready implementation:

import OpenAI from 'openai';
interface Golden {
id: string;
input: Array<{ role: string; content: string }>;
ideal: string;
}
interface TestResult {
id: string;
passed: boolean;
latencyMs: number;
expected: string;
actual: string | null;
error?: string;
}
interface TestSummary {
timestamp: string;
model: string;
totalTests: number;
passed: number;
failed: number;
accuracy: number;
avgLatencyMs: number;
details: TestResult[];
}
class RegressionTesterTS {
private client: OpenAI;
private model: string;
private results: TestResult[];
constructor(model: string = 'gpt-4o-mini') {
this.client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
this.model = model;
this.results = [];
}
async loadGoldens(path: string): Promise<Golden[]> {
const fs = require('fs');
const content = fs.readFileSync(path, 'utf-8');
return content
.split('\n')
.filter((line) => line.trim())
.map((line) => JSON.parse(line));
}
async queryModel(messages: Array<{ role: string; content: string }>): Promise<[string, number]> {
const startTime = Date.now();
const response = await this.client.chat.completions.create({
model: this.model,
messages: messages,
temperature: 0.0,
});
const latency = Date.now() - startTime;
return [response.choices[0].message.content || '', latency];
}
async evaluate(golden: Golden): Promise<TestResult> {
try {
const [output, latency] = await this.queryModel(golden.input);
const passed = output.trim() === golden.ideal.trim();
return {
id: golden.id,
passed,
latencyMs: latency,
expected: golden.ideal,
actual: output,
};
} catch (error) {
return {
id: golden.id,
passed: false,
latencyMs: 0,
expected: golden.ideal,
actual: null,
error: error instanceof Error ? error.message : 'Unknown error',
};
}
}
async run(datasetPath: string): Promise<TestSummary> {
const goldens = await this.loadGoldens(datasetPath);
for (const golden of goldens) {
const result = await this.evaluate(golden);
this.results.push(result);
}
const passed = this.results.filter((r) => r.passed).length;
const total = this.results.length;
const avgLatency = this.results.reduce((sum, r) => sum + r.latencyMs, 0) / total;
return {
timestamp: new Date().toISOString(),
model: this.model,
totalTests: total,
passed,
failed: total - passed,
accuracy: Math.round((passed / total) * 100),
avgLatencyMs: Math.round(avgLatency),
details: this.results,
};
}
printReport(summary: TestSummary): void {
console.log(`\n${'='.repeat(60)}`);
console.log(`Regression Test Report - ${summary.timestamp}`);
console.log(`${'='.repeat(60)}`);
console.log(`Model: ${summary.model}`);
console.log(`Total Tests: ${summary.totalTests}`);
console.log(`Passed: ${summary.passed}`);
console.log(`Failed: ${summary.failed}`);
console.log(`Accuracy: ${summary.accuracy}%`);
console.log(`Avg Latency: ${summary.avgLatencyMs}ms`);
console.log(`${'='.repeat(60)}\n`);
if (summary.failed > 0) {
console.log('Failed Tests:');
for (const result of this.results) {
if (!result.passed) {
console.log(` - ${result.id}: ${result.error || 'Mismatch'}`);
}
}
console.log();
}
}
}
async function main() {
const tester = new RegressionTesterTS();
const summary = await tester.run('golden_dataset.jsonl');
tester.printReport(summary);
process.exit(summary.failed === 0 ? 0 : 1);
}
main().catch(console.error);

Key features:

  • Type safety with TypeScript interfaces
  • Async/await for efficient API calls
  • Proper exit codes for CI/CD gates
  • Error boundaries with try/catch blocks

Usage in CI:

Terminal window
# Compile and run
npx ts-node regression-tester.ts
# Or compile to JS and run
tsc regression-tester.ts && node regression-tester.js

1. Using Only Exact Match for Subjective Tasks

Section titled “1. Using Only Exact Match for Subjective Tasks”

Problem: Exact match fails when models produce semantically correct but differently phrased answers.
Solution: Use model-graded evaluation for creative tasks, exact match only for deterministic outputs.

Problem: A model that’s 10% less accurate but 5x faster might be preferable, but you can’t optimize what you don’t measure.
Solution: Always track latency alongside accuracy in your test suite.

Problem: Without version control, you can’t distinguish between model performance changes and test data changes.
Solution: Store golden datasets in git alongside your code and review changes in pull requests.

Problem: Model providers silently update systems, causing production regressions.
Solution: Implement post-deployment monitoring with canary traffic and continuous evaluation.

5. Using Same Model for Generation and Evaluation

Section titled “5. Using Same Model for Generation and Evaluation”

Problem: Introduces bias; the model may grade leniently on its own outputs.
Solution: Use a stronger model (e.g., GPT-4o) to grade weaker models (e.g., GPT-4o-mini).

Problem: Running full suites on expensive models can cost hundreds per run.
Solution: Use tiered testing:

  • Pre-commit: 10% sample on cheapest model
  • Pre-deployment: 100% on mid-tier model
  • Weekly: Full suite on best model

Problem: Without automated gates, teams ignore test results to ship faster.
Solution: Block deployments when accuracy drops below threshold (e.g., 95%) or latency exceeds SLA.

Problem: Golden datasets often miss adversarial inputs or rare scenarios.
Solution: Include 10-20% edge cases in your dataset: empty inputs, very long inputs, malformed JSON, etc.

Problem: Models learn to game specific test cases without generalizing.
Solution: Rotate 20% of test cases monthly and maintain a separate holdout validation set.

Problem: OpenAI/Anthropic may update models without notice, breaking your application.
Solution: Run regression tests weekly against production models to catch silent changes.

Task TypeExampleRecommended MethodWhy
SQL GenerationSELECT * FROM usersExact MatchSyntax must be perfect
JSON Output{“key”: “value”}Exact Match + Schema ValidationStructure is critical
Classificationsentiment: positiveExact MatchLabels are discrete
SummarizationArticle summaryModel-Graded (GPT-4)Many valid summaries exist
Creative WritingMarketing copyModel-Graded + Human ReviewSubjective quality
Code GenerationPython functionExact Match (syntax) + Unit TestsMust run correctly
# Run this calculation before choosing your testing strategy
models = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60}, # $ per 1M tokens
"gpt-4o": {"input": 5.00, "output": 15.00},
"claude-3-5-haiku": {"input": 0.80, "output": 4.00},
}
# Example: 1000 tests, 500 tokens input, 200 tokens output each
tests = 1000
input_tokens = 500 * tests # 500K tokens
output_tokens = 200 * tests # 200K tokens
for model, pricing in models.items():
cost = (input_tokens * pricing["input"] / 1e6) + (output_tokens * pricing["output"] / 1e6)
print(f"{model}: ${cost:.2f} per full suite")
MetricWarning ThresholdFailure Threshold
Accuracyless than 97%less than 95%
Latency (p95)greater than 2000msgreater than 5000ms
Cost per 1K requestsgreater than $0.50greater than $1.00
Error Rategreater than 2%greater than 5%
Application StageTest CasesRationale
Prototype10-20Quick feedback
Beta50-100Moderate coverage
Production200-500Comprehensive testing
Enterprise500+Full business logic coverage

Regression test generator (example cases → test suite)

Interactive widget derived from “Regression Testing for LLMs: Preventing Quality Drops” that lets readers explore regression test generator (example cases → test suite).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.