Skip to content
GitHubX/TwitterRSS

SLOs for AI Systems: Reliability Targets

SLOs for AI Systems: Reliability Targets That Actually Matter

Section titled “SLOs for AI Systems: Reliability Targets That Actually Matter”

Your AI system returned 99.9% successful responses last month—but your customers were furious. The difference between technical uptime and user-perceived reliability is where most AI SLOs fail. This guide will teach you how to design SLOs that capture what users actually care about, select SLIs that predict problems before they escalate, and manage error budgets that balance reliability with cost.

AI systems introduce failure modes that traditional SLOs cannot capture. A 200ms response with a hallucinated answer is worse than a 500ms response with accurate information. Your SLOs must reflect this reality.

Most teams default to “99.9% uptime” and call it done. This creates three critical problems:

  1. False confidence: Your dashboards show green while users experience degraded quality
  2. Budget misallocation: You optimize for latency when the real problem is accuracy
  3. Escalating costs: You throw compute at problems that aren’t actually compute-related

Consider a real-world scenario: A customer support chatbot achieved 99.95% uptime last quarter. However, 3% of its responses contained factual errors. The engineering team spent $12,000/month on additional GPU capacity to reduce latency from 800ms to 600ms—while the accuracy problem drove a 15% increase in human agent escalations, costing $45,000 in support overhead.

Traditional SLOs measure:

  • Request success rate
  • Response latency
  • System availability

AI systems also need:

  • Output quality: Accuracy, relevance, hallucination rate
  • Consistency: Same input produces same quality output
  • Cost efficiency: Token burn per successful outcome
  • Fairness: Consistent performance across user segments

Service Level Objectives vs. Indicators vs. Objectives

Section titled “Service Level Objectives vs. Indicators vs. Objectives”

SLI (Service Level Indicator): What you measure

  • Example: Percentage of responses with factual accuracy greater than 95%

SLO (Service Level Objective): Your target for that metric

  • Example: 99% of responses must have factual accuracy greater than 95% over a 28-day window

SLA (Service Level Agreement): Contractual promise to customers with penalties

  • Example: 95% accuracy SLA with 10% service credit if breached

Your error budget is the maximum allowable SLO violation before you must take action. For a 99% availability SLO, you have 1% error budget (7.2 hours of downtime per month).

For AI systems, you need multiple error budgets:

Error Budget TypeCalculationAction Trigger
Quality1% of responses can be inaccurateEscalate to model tuning
Latency5% of requests greater than target P95Scale infrastructure
Cost10% over token budgetImplement caching

Step 1: Map User Journeys to Reliability Needs

Section titled “Step 1: Map User Journeys to Reliability Needs”

Not all requests deserve the same SLO. Segment by user journey:

High-stakes queries (financial advice, medical diagnosis)

  • Accuracy: 99.9%
  • Latency: P95 less than 2s
  • Cost: Secondary concern

Low-stakes queries (summarization, brainstorming)

  • Accuracy: 90%
  • Latency: P95 less than 1s
  • Cost: Primary constraint

Interactive queries (chat, real-time assistance)

  • Accuracy: 95%
  • Latency: P95 less than 500ms
  • Cost: Moderate concern

Choose SLIs that are:

  1. Actionable: You can fix problems they detect
  2. Measurable: You can collect data without 100% sampling
  3. User-centric: They reflect actual user experience

1. Output Quality SLI

AI reliability targets directly impact both user trust and operational costs. When your AI system fails to meet quality SLOs, users abandon the product, support costs spike, and you waste compute on responses that don’t solve problems. Conversely, over-provisioning for reliability that users don’t value burns through budget unnecessarily.

The pricing data reveals the stakes: OpenAI gpt-4o costs $5.00/$15.00 per 1M input/output tokens, while gpt-4o-mini costs just $0.150/$0.600 per 1M tokens—a 30x cost difference. If your SLOs don’t distinguish between high-value and low-value queries, you’ll either overspend on simple tasks or under-serve critical ones.

Start with these three foundational SLIs that work across most AI systems:

1. Quality Score SLI

SLI: quality_score
Description: Percentage of responses that pass automated quality checks
Calculation: (good_responses / total_responses) × 100
Good response: Score ≥ threshold on accuracy, relevance, and safety checks
Target: 95% over 7-day rolling window

2. Latency SLI

SLI: response_latency
Description: Time from request to first token delivered
Calculation: P95 of end-to-end latency
Target: P95 less than 2 seconds for interactive queries

3. Cost Efficiency SLI

SLI: cost_per_outcome
Description: Token cost per successful user outcome
Calculation: total_tokens_used / successful_resolutions
Target: less than $0.01 per successful interaction

Use these metrics to populate your SLIs:

MetricSourceCollection Method
ai.response.quality_scoreModel output evaluationAutomated LLM-as-judge
ai.response.time_to_first_tokenApplication logsInstrumentation
ai.usage.input_tokensAPI usage logsProvider webhooks
ai.usage.output_tokensAPI usage logsProvider webhooks
ai.errors.hallucination_rateHuman review sampleManual audit pipeline

Define clear thresholds for each budget:

Budget TypeWarning (80% consumed)Critical (100% consumed)Emergency (120% consumed)
QualityReview evaluation promptsRollback model versionDisable feature
LatencyAdd capacityEnable cachingRate limit
CostOptimize promptsSwitch to cheaper modelDisable non-critical features

Here’s a complete SLO monitoring implementation using Datadog:

import time
from datadog import initialize, api
from typing import Dict, List
class AISLOMonitor:
def __init__(self, api_key: str, app_key: str):
initialize(api_key=api_key, app_key=app_key)
self.quality_threshold = 0.95
self.latency_threshold_ms = 2000
self.max_cost_per_query = 0.01
def evaluate_response_quality(self, response: Dict) -> float:
"""
Evaluate response quality using automated checks.
Returns score between 0 and 1.
"""
score = 1.0
# Check for hallucinations (simplified)
if "I don't know" in response.get("text", ""):
score -= 0.3
# Check for relevance
if len(response.get("text", "")) < 10:
score -= 0.2
# Check for safety violations
if any(word in response.get("text", "").lower()
for word in ["harm", "illegal", "dangerous"]):
score = 0
return max(0, score)
def record_metrics(self, response: Dict, tokens_used: Dict):
"""Record SLI metrics to Datadog."""
# Quality metric
quality_score = self.evaluate_response_quality(response)
api.Metric.send(
metric="ai.response.quality_score",
points=[(time.time(), quality_score)],
tags=[f"model:{response['model']}", f"endpoint:{response['endpoint']}"]
)
# Latency metric
latency = response.get("latency_ms", 0)
api.Metric.send(
metric="ai.response.latency",
points=[(time.time(), latency)],
tags=[f"model:{response['model']}"]
)
# Cost metric
input_cost = (tokens_used['input'] / 1_000_000) * 5.00 # gpt-4o pricing
output_cost = (tokens_used['output'] / 1_000_000) * 15.00
total_cost = input_cost + output_cost
api.Metric.send(
metric="ai.usage.cost_per_query",
points=[(time.time(), total_cost)],
tags=[f"model:{response['model']}"]
)
return {
"quality": quality_score,
"latency": latency,
"cost": total_cost
}
def check_slo_breaches(self, metrics: Dict) -> List[str]:
"""Check if any SLOs are being breached."""
breaches = []
if metrics['quality'] < self.quality_threshold:
breaches.append(f"Quality SLO breached: {metrics['quality']:.2%} less than {self.quality_threshold:.2%}")
if metrics['latency'] > self.latency_threshold_ms:
breaches.append(f"Latency SLO breached: {metrics['latency']}ms greater than {self.latency_threshold_ms}ms")
if metrics['cost'] > self.max_cost_per_query:
breaches.append(f"Cost SLO breached: ${metrics['cost']:.4f} greater than ${self.max_cost_per_query:.4f}")
return breaches
# Usage example
if __name__ == "__main__":
monitor = AISLOMonitor(api_key="your_api_key", app_key="your_app_key")
# Simulate a response
response = {
"model": "gpt-4o",
"endpoint": "/chat",
"text": "The capital of France is Paris.",
"latency_ms": 450
}
tokens = {"input": 150, "output": 25}
# Record and check
metrics = monitor.record_metrics(response, tokens)
breaches = monitor.check_slo_breaches(metrics)
if breaches:
print("SLO Breaches detected:")
for breach in breaches:
print(f" - {breach}")
else:
print("All SLOs within targets")

Pitfall: Tracking requests_per_second and error_rate while ignoring quality. Impact: System shows “green” while users experience failures. Fix: Always pair technical metrics with quality indicators.

Pitfall: One error budget for latency, quality, and cost combined. Impact: Can’t identify which dimension is failing. Fix: Separate budgets with independent action triggers.

Pitfall: SLOs that don’t account for variable pricing across models. Impact: Costs spiral as usage grows. Fix: Include cost_per_successful_outcome as a primary SLI.

Pitfall: Fixed latency targets regardless of query complexity. Impact: Impossible to meet for complex reasoning tasks. Fix: Use dynamic thresholds based on input complexity.

Pitfall: Only evaluating responses that complete successfully. Impact: Missing failures in slow or rejected responses. Fix: Include all requests in SLI calculations, even timeouts.

# Customer Support Chatbot
quality: 95% accuracy over 7 days
latency: P95 less than 1.5s
cost: less than $0.005 per conversation
error_budget: 5% quality, 10% latency, 15% cost
# Code Generation Assistant
quality: 90% compileable output
latency: P95 less than 3s
cost: less than $0.02 per generation
error_budget: 10% quality, 5% latency, 10% cost
# Content Moderation
quality: 99.5% accuracy
latency: P95 less than 500ms
cost: less than $0.001 per check
error_budget: 0.5% quality, 20% latency, 5% cost

Before deploying any SLO, verify:

  • Measurable: Can you collect data without 100% sampling?
  • Actionable: Can you fix problems they detect?
  • User-centric: Do they reflect actual user experience?
  • Independent: Do you have separate budgets for quality, latency, and cost?
  • Cost-aware: Do you track token costs per successful outcome?

SLO calculator (requirements → targets)

Interactive widget derived from “SLOs for AI Systems: Reliability Targets” that lets readers explore slo calculator (requirements → targets).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Effective AI SLOs bridge the gap between technical metrics and user-perceived reliability. This guide demonstrated how to:

  1. Design user-centric SLOs that measure quality alongside uptime
  2. Select actionable SLIs for accuracy, latency, and cost efficiency
  3. Implement multi-budget error tracking with independent thresholds
  4. Avoid common pitfalls like measuring only technical metrics or ignoring token costs

The core principle: AI reliability targets must reflect what users actually experience. A system that responds 100% of the time with incorrect information is functionally down. By implementing separate quality, latency, and cost error budgets, you gain the visibility needed to prioritize engineering effort where it matters most.

ModelProviderInput Cost/1MOutput Cost/1MContext Window
claude-3-5-sonnetAnthropic$3.00$15.00200,000 tokens
haiku-3.5Anthropic$1.25$5.00200,000 tokens
gpt-4oOpenAI$5.00$15.00128,000 tokens
gpt-4o-miniOpenAI$0.150$0.600128,000 tokens