Skip to content
GitHubX/TwitterRSS

Alerting Strategies for AI Systems: When to Page

Alerting Strategies for AI Systems: When to Page

Section titled “Alerting Strategies for AI Systems: When to Page”

A Series A startup discovered their production LLM service had been returning 500 errors for six hours on a Saturday night. Their monitoring dashboard showed perfect metrics because they were alerting on raw error counts, not error rates. By Monday morning, they had lost 12,000 customer interactions and faced a wave of churn. This guide provides battle-tested alerting strategies specifically designed for AI systems, so you never face a similar surprise.

Traditional application monitoring fails for AI systems because LLMs have unique failure modes: hallucinations, rate limiting, token budget exhaustion, and queue-based latency spikes. A 2% error rate in a standard web API might be acceptable, but in a 99.9% SLO LLM service, that same error rate consumes 20% of your monthly error budget in just 50 minutes.

The cost implications are severe. According to Google Cloud’s alerting documentation, alerting policies charge per-condition costs. A poorly designed system with 50 separate alerts can cost thousands per month, while a consolidated approach reduces costs by 70% while improving detection accuracy. For context, using Claude 3.5 Sonnet at $3.00/$15.00 per 1M tokens, a single misconfigured alert that triggers a retry storm can burn through hundreds of dollars before engineers respond.

SLO-based alerting transforms how you detect AI system issues by focusing on user impact rather than internal metrics. Instead of asking “is the error count high?”, you ask “are we consuming our error budget faster than expected?”

The Multi-Window, Multi-Burn-Rate Strategy

Section titled “The Multi-Window, Multi-Burn-Rate Strategy”

This approach, pioneered by Google SRE, uses two time windows to validate alerts: a short window for rapid detection and a long window to confirm sustained problems. The dual-window validation prevents false positives from transient spikes while maintaining fast detection for real incidents.

For a 99.9% SLO (0.1% error budget), here’s how burn rates translate to budget consumption:

Budget ConsumedTime to ExhaustBurn RateAlert Severity
2% in 1 hour50 hours14.4xPage (immediate)
5% in 6 hours120 hours6xPage (urgent)
10% in 3 days30 days1xTicket (business hours)

These thresholds ensure you page on-call engineers only when the incident poses immediate budget risk, while lower-severity issues generate tickets for business-hours response.

Alerting on raw, unaggregated metrics creates three critical problems:

  1. Cardinality explosion: Individual model endpoints generate millions of unique metric series, multiplying alerting costs exponentially
  2. False positives: Single request failures trigger alerts even when automatic retries succeed
  3. Poor detection time: Fixed-duration clauses (e.g., “for: 1h”) cannot distinguish between a 1% error rate sustained for an hour versus a 50% error rate for 2 minutes

The recommended approach aggregates metrics by service and uses burn rate calculations that account for both severity and duration.

These alerts detect when your LLM service is violating SLO targets. The key is measuring error rates against your error budget, not absolute thresholds.

Recommended thresholds for 99.9% SLO:

  • Page: 2% budget consumed in 1 hour (14.4x burn rate)
  • Page: 5% budget consumed in 6 hours (6x burn rate)
  • Ticket: 10% budget consumed in 3 days (1x burn rate)

For systems using inference servers like JetStream, queue metrics provide early warning before latency degrades. Google Cloud’s best practices recommend monitoring:

  • jetstream_prefill_backlog_size: Number of requests waiting for prefill (latency signal)
  • jetstream_slots_used_percentage: Percentage of decode slots in use (throughput signal)

A prefill backlog of greater than 5 requests with positive growth rate indicates impending latency issues, while slot utilization greater than 95% means you’re at capacity and should scale immediately.

TPU High Bandwidth Memory (HBM) usage is the primary bottleneck for LLM inference. Monitor:

  • HBM usage greater than 85% for 10+ minutes: Scale up soon
  • HBM usage greater than 95% for 5 minutes: Scale up immediately
  • Token rate spike greater than 1.5x 7-day average: Investigate for runaway requests or traffic anomalies

Given LLM costs ($3-15 per 1M tokens), sudden traffic spikes can create bill shock. Alert on token consumption rates that exceed historical baselines by significant margins.

Practical Implementation: Building Your Alerting System

Section titled “Practical Implementation: Building Your Alerting System”
  1. Define your SLO targets

    Choose realistic availability targets based on user expectations. For consumer-facing chatbots, 99.9% is common. For critical financial applications, 99.95% might be required. Document your error budget: 99.9% = 43 minutes of downtime per month.

  2. Calculate burn rate thresholds

    For your chosen SLO, compute the burn rates that consume 2%, 5%, and 10% of budget in your target windows. Use the formula: burn_rate = (error_rate / error_budget). For 99.9% SLO (0.001 budget), a 1.44% error rate = 14.4x burn rate.

  3. Consolidate alerts using aggregation

    Instead of separate policies per model endpoint, create one policy per service that aggregates across all endpoints. This reduces per-condition costs by 70% and provides a unified view.

  4. Implement dual-window validation

    Configure both long and short windows for each threshold. The short window (5-30 minutes) detects rapid changes, while the long window (1-3 days) confirms sustained issues.

  5. Set up metamonitoring

    Monitor your monitoring system. Ensure alert delivery is working and that alert silence windows are properly configured. Test your on-call rotation monthly.

  6. Tune thresholds with historical data

    Run your alerting rules in “monitoring mode” for 2-4 weeks before enabling paging. Adjust thresholds based on your actual traffic patterns and error rates.

The following production-ready examples demonstrate how to implement the multi-burn-rate strategy for LLM services. Each example focuses on a different aspect of AI system monitoring.

import time
from typing import Dict, List, Tuple
import numpy as np
class SLOAlertEngine:
"""
Production-ready SLO-based alerting engine implementing multi-window,
multi-burn-rate alerting as recommended by Google SRE best practices.
Key design decisions:
- Uses burn rate calculations to detect significant budget consumption
- Implements dual-window validation to reduce false positives
- Supports both page and ticket severity levels
"""
def __init__(self, slo_target: float = 0.999, budget_window_days: int = 30):
self.slo_target = slo_target
self.error_budget = 1.0 - slo_target # e.g., 0.001 for 99.9% SLO
self.budget_window_seconds = budget_window_days * 24 * 3600
# Recommended thresholds from Google SRE workbook
self.page_thresholds = [
(3600, 300, 14.4), # 1h long, 5m short, 14.4x burn rate (2% budget)
(21600, 1800, 6.0) # 6h long, 30m short, 6x burn rate (5% budget)
]
self.ticket_threshold = (259200, 21600, 1.0) # 3d long, 6h short, 1x burn rate (10% budget)
def calculate_burn_rate(self, error_rate: float) -> float:
"""Calculate burn rate from current error rate."""
return error_rate / self.error_budget
def evaluate_alert(self, metrics_history: List[Tuple[float, float]],
current_time: float) -> Dict[str, any]:
"""
Evaluate whether to trigger an alert based on historical metrics.
Args:
metrics_history: List of (timestamp, error_rate) tuples
current_time: Current timestamp
Returns:
Dictionary with alert decision and details
"""
if not metrics_history:
return {"alert": False, "reason": "No data"}
# Filter metrics within our maximum window (3 days)
max_window = max(self.page_thresholds[0][0], self.page_thresholds[1][0],
self.ticket_threshold[0])
cutoff_time = current_time - max_window
recent_metrics = [(t, e) for t, e in metrics_history if t >= cutoff_time]
if not recent_metrics:
return {"alert": False, "reason": "Insufficient recent data"}
# Check each threshold
for long_window, short_window, burn_rate in self.page_thresholds:
if self._check_dual_window(recent_metrics, current_time,
long_window, short_window, burn_rate):
return {
"alert": True,
"severity": "page",
"burn_rate": burn_rate,
"budget_consumed": f"{(burn_rate * long_window / self.budget_window_seconds * 100):.2f}%"
}
# Check ticket threshold
long_w, short_w, burn_rate = self.ticket_threshold
if self._check_dual_window(recent_metrics, current_time,
long_w, short_w, burn_rate):
return {
"alert": True,
"severity": "ticket",
"burn_rate": burn_rate,
"budget_consumed": f"{(burn_rate * long_w / self.budget_window_seconds * 100):.2f}%"
}
return {"alert": False, "reason": "Within SLO thresholds"}
def _check_dual_window(self, metrics: List[Tuple[float, float]],
current_time: float, long_window: int,
short_window: int, burn_rate_threshold: float) -> bool:
"""
Check if both long and short windows exceed burn rate threshold.
This prevents false positives from transient spikes.
"""
# Calculate average error rate for long window
long_cutoff = current_time - long_window
long_metrics = [(t, e) for t, e in metrics if t >= long_cutoff]
if not long_metrics:
return False
long_error_rate = np.mean([e for _, e in long_metrics])
long_burn_rate = self.calculate_burn_rate(long_error_rate)
if long_burn_rate < burn_rate_threshold:
return False
# Calculate average error rate for short window
short_cutoff = current_time - short_window
short_metrics = [(t, e) for t, e in metrics if t >= short_cutoff]
if not short_metrics:
return False
short_error_rate = np.mean([e for _, e in short_metrics])
short_burn_rate = self.calculate_burn_rate(short_error_rate)
return short_burn_rate >= burn_rate_threshold
# Example usage
if __name__ == "__main__":
engine = SLOAlertEngine(slo_target=0.999)
# Simulate metrics over time
current_time = time.time()
metrics = []
# Generate normal traffic (0.05% error rate)
for i in range(100):
metrics.append((current_time - (3600 - i*36), 0.0005))
# Inject incident: 1% error rate for 10 minutes
for i in range(10):
metrics.append((current_time - (600 - i*6), 0.01))
result = engine.evaluate_alert(metrics, current_time)
print(f"Alert Decision: {result}")
  1. Service-level aggregation: Create one alert policy per service that aggregates across all model endpoints
  2. Dynamic thresholds: Use burn rate calculations instead of fixed thresholds to reduce alert noise
  3. Metamonitoring: Monitor your monitoring system to ensure alerts are delivered correctly

Based on Google Cloud’s pricing model, consolidating 50 individual endpoint alerts into 5 service-level alerts reduces costs by approximately 70% while improving detection accuracy through better signal-to-noise ratio.

Before enabling production paging, validate your alerting rules:

  1. Historical replay: Run rules against past incidents to verify detection
  2. Synthetic testing: Generate controlled error rates to trigger alerts
  3. Chaos engineering: Introduce real failures in staging environments
  4. On-call drills: Test alert delivery and escalation monthly

Alert rule builder

Interactive widget derived from “Alerting Strategies for AI Systems: When to Page” that lets readers explore alert rule builder.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Effective AI alerting requires moving beyond simple threshold monitoring to sophisticated burn-rate calculations that measure user impact. By implementing multi-window, multi-burn-rate alerting, you can detect incidents in minutes rather than hours while controlling costs through smart aggregation.

The key principles are:

  • Alert on budget consumption, not absolute errors
  • Use dual windows to prevent false positives
  • Aggregate metrics at the service level
  • Test thoroughly before enabling production paging

Start with the Python SLO engine example to build your alerting foundation, then expand with JetStream monitoring and cost anomaly detection as your system grows.