Alerting Strategies for AI Systems: When to Page

A Series A startup discovered their production LLM service had been returning 500 errors for six hours on a Saturday night. Their monitoring dashboard showed perfect metrics because they were alerting on raw error counts, not error rates. By Monday morning, they had lost 12,000 customer interactions and faced a wave of churn. This guide provides battle-tested alerting strategies specifically designed for AI systems, so you never face a similar surprise.

Why This Matters

Traditional application monitoring fails for AI systems because LLMs have unique failure modes: hallucinations, rate limiting, token budget exhaustion, and queue-based latency spikes. A 2% error rate in a standard web API might be acceptable, but in a 99.9% SLO LLM service, that same error rate consumes 20% of your monthly error budget in just 50 minutes.

The cost implications are severe. According to Google Cloud’s alerting documentation, alerting policies charge per-condition costs. A poorly designed system with 50 separate alerts can cost thousands per month, while a consolidated approach reduces costs by 70% while improving detection accuracy. For context, using Claude 3.5 Sonnet at $3.00/$15.00 per 1M tokens, a single misconfigured alert that triggers a retry storm can burn through hundreds of dollars before engineers respond.

Understanding SLO-Based Alerting for AI

SLO-based alerting transforms how you detect AI system issues by focusing on user impact rather than internal metrics. Instead of asking “is the error count high?”, you ask “are we consuming our error budget faster than expected?”

The Multi-Window, Multi-Burn-Rate Strategy

This approach, pioneered by Google SRE, uses two time windows to validate alerts: a short window for rapid detection and a long window to confirm sustained problems. The dual-window validation prevents false positives from transient spikes while maintaining fast detection for real incidents.

For a 99.9% SLO (0.1% error budget), here’s how burn rates translate to budget consumption:

Budget Consumed	Time to Exhaust	Burn Rate	Alert Severity
2% in 1 hour	50 hours	14.4x	Page (immediate)
5% in 6 hours	120 hours	6x	Page (urgent)
10% in 3 days	30 days	1x	Ticket (business hours)

These thresholds ensure you page on-call engineers only when the incident poses immediate budget risk, while lower-severity issues generate tickets for business-hours response.

Why Raw Metrics Fail for AI

Alerting on raw, unaggregated metrics creates three critical problems:

Cardinality explosion: Individual model endpoints generate millions of unique metric series, multiplying alerting costs exponentially
False positives: Single request failures trigger alerts even when automatic retries succeed
Poor detection time: Fixed-duration clauses (e.g., “for: 1h”) cannot distinguish between a 1% error rate sustained for an hour versus a 50% error rate for 2 minutes

The recommended approach aggregates metrics by service and uses burn rate calculations that account for both severity and duration.

Core Alert Types for LLM Systems

1. Error Rate & Budget Consumption Alerts

These alerts detect when your LLM service is violating SLO targets. The key is measuring error rates against your error budget, not absolute thresholds.

Recommended thresholds for 99.9% SLO:

Page: 2% budget consumed in 1 hour (14.4x burn rate)
Page: 5% budget consumed in 6 hours (6x burn rate)
Ticket: 10% budget consumed in 3 days (1x burn rate)

2. Queue-Based Latency Alerts

For systems using inference servers like JetStream, queue metrics provide early warning before latency degrades. Google Cloud’s best practices recommend monitoring:

jetstream_prefill_backlog_size: Number of requests waiting for prefill (latency signal)
jetstream_slots_used_percentage: Percentage of decode slots in use (throughput signal)

A prefill backlog of greater than 5 requests with positive growth rate indicates impending latency issues, while slot utilization greater than 95% means you’re at capacity and should scale immediately.

3. Capacity & Autoscaling Alerts

TPU High Bandwidth Memory (HBM) usage is the primary bottleneck for LLM inference. Monitor:

HBM usage greater than 85% for 10+ minutes: Scale up soon
HBM usage greater than 95% for 5 minutes: Scale up immediately
Token rate spike greater than 1.5x 7-day average: Investigate for runaway requests or traffic anomalies

4. Cost Anomaly Detection

Given LLM costs ($3-15 per 1M tokens), sudden traffic spikes can create bill shock. Alert on token consumption rates that exceed historical baselines by significant margins.

Practical Implementation: Building Your Alerting System

Define your SLO targets

Choose realistic availability targets based on user expectations. For consumer-facing chatbots, 99.9% is common. For critical financial applications, 99.95% might be required. Document your error budget: 99.9% = 43 minutes of downtime per month.
Calculate burn rate thresholds

For your chosen SLO, compute the burn rates that consume 2%, 5%, and 10% of budget in your target windows. Use the formula: burn_rate = (error_rate / error_budget). For 99.9% SLO (0.001 budget), a 1.44% error rate = 14.4x burn rate.
Consolidate alerts using aggregation

Instead of separate policies per model endpoint, create one policy per service that aggregates across all endpoints. This reduces per-condition costs by 70% and provides a unified view.
Implement dual-window validation

Configure both long and short windows for each threshold. The short window (5-30 minutes) detects rapid changes, while the long window (1-3 days) confirms sustained issues.
Set up metamonitoring

Monitor your monitoring system. Ensure alert delivery is working and that alert silence windows are properly configured. Test your on-call rotation monthly.
Tune thresholds with historical data

Run your alerting rules in “monitoring mode” for 2-4 weeks before enabling paging. Adjust thresholds based on your actual traffic patterns and error rates.

Code Examples: Production-Ready Alerting

The following production-ready examples demonstrate how to implement the multi-burn-rate strategy for LLM services. Each example focuses on a different aspect of AI system monitoring.

import time
from typing import Dict, List, Tuple
import numpy as np

class SLOAlertEngine:
    """
    Production-ready SLO-based alerting engine implementing multi-window,
    multi-burn-rate alerting as recommended by Google SRE best practices.

    Key design decisions:
    - Uses burn rate calculations to detect significant budget consumption
    - Implements dual-window validation to reduce false positives
    - Supports both page and ticket severity levels
    """

    def __init__(self, slo_target: float = 0.999, budget_window_days: int = 30):
        self.slo_target = slo_target
        self.error_budget = 1.0 - slo_target  # e.g., 0.001 for 99.9% SLO
        self.budget_window_seconds = budget_window_days * 24 * 3600

        # Recommended thresholds from Google SRE workbook
        self.page_thresholds = [
            (3600, 300, 14.4),   # 1h long, 5m short, 14.4x burn rate (2% budget)
            (21600, 1800, 6.0)   # 6h long, 30m short, 6x burn rate (5% budget)
        ]
        self.ticket_threshold = (259200, 21600, 1.0)  # 3d long, 6h short, 1x burn rate (10% budget)

    def calculate_burn_rate(self, error_rate: float) -> float:
        """Calculate burn rate from current error rate."""
        return error_rate / self.error_budget

    def evaluate_alert(self, metrics_history: List[Tuple[float, float]],
                      current_time: float) -> Dict[str, any]:
        """
        Evaluate whether to trigger an alert based on historical metrics.

        Args:
            metrics_history: List of (timestamp, error_rate) tuples
            current_time: Current timestamp

        Returns:
            Dictionary with alert decision and details
        """
        if not metrics_history:
            return {"alert": False, "reason": "No data"}

        # Filter metrics within our maximum window (3 days)
        max_window = max(self.page_thresholds[0][0], self.page_thresholds[1][0],
                        self.ticket_threshold[0])
        cutoff_time = current_time - max_window

        recent_metrics = [(t, e) for t, e in metrics_history if t >= cutoff_time]
        if not recent_metrics:
            return {"alert": False, "reason": "Insufficient recent data"}

        # Check each threshold
        for long_window, short_window, burn_rate in self.page_thresholds:
            if self._check_dual_window(recent_metrics, current_time,
                                      long_window, short_window, burn_rate):
                return {
                    "alert": True,
                    "severity": "page",
                    "burn_rate": burn_rate,
                    "budget_consumed": f"{(burn_rate * long_window / self.budget_window_seconds * 100):.2f}%"
                }

        # Check ticket threshold
        long_w, short_w, burn_rate = self.ticket_threshold
        if self._check_dual_window(recent_metrics, current_time,
                                  long_w, short_w, burn_rate):
            return {
                "alert": True,
                "severity": "ticket",
                "burn_rate": burn_rate,
                "budget_consumed": f"{(burn_rate * long_w / self.budget_window_seconds * 100):.2f}%"
            }

        return {"alert": False, "reason": "Within SLO thresholds"}

    def _check_dual_window(self, metrics: List[Tuple[float, float]],
                          current_time: float, long_window: int,
                          short_window: int, burn_rate_threshold: float) -> bool:
        """
        Check if both long and short windows exceed burn rate threshold.
        This prevents false positives from transient spikes.
        """
        # Calculate average error rate for long window
        long_cutoff = current_time - long_window
        long_metrics = [(t, e) for t, e in metrics if t >= long_cutoff]
        if not long_metrics:
            return False

        long_error_rate = np.mean([e for _, e in long_metrics])
        long_burn_rate = self.calculate_burn_rate(long_error_rate)

        if long_burn_rate < burn_rate_threshold:
            return False

        # Calculate average error rate for short window
        short_cutoff = current_time - short_window
        short_metrics = [(t, e) for t, e in metrics if t >= short_cutoff]
        if not short_metrics:
            return False

        short_error_rate = np.mean([e for _, e in short_metrics])
        short_burn_rate = self.calculate_burn_rate(short_error_rate)

        return short_burn_rate >= burn_rate_threshold

# Example usage
if __name__ == "__main__":
    engine = SLOAlertEngine(slo_target=0.999)

    # Simulate metrics over time
    current_time = time.time()
    metrics = []

    # Generate normal traffic (0.05% error rate)
    for i in range(100):
        metrics.append((current_time - (3600 - i*36), 0.0005))

    # Inject incident: 1% error rate for 10 minutes
    for i in range(10):
        metrics.append((current_time - (600 - i*6), 0.01))

    result = engine.evaluate_alert(metrics, current_time)
    print(f"Alert Decision: {result}")

interface JetStreamMetrics {
  prefillBacklogSize: number;
  slotsUsedPercentage: number;
  timestamp: number;
}

interface AlertConfig {
  prefillQueueThreshold: number;
  slotsUsedThreshold: number;
  evaluationWindowMs: number;
  cooldownMs: number;
}

/**
 * Monitors JetStream inference server metrics and triggers alerts
 * based on queue backlog and slot utilization.
 *
 * Based on Google Cloud best practices for LLM autoscaling:
 * - jetstream_prefill_backlog_size for latency-sensitive alerts
 * - jetstream_slots_used_percentage for throughput alerts
 */
class JetStreamAlertMonitor {
  private config: AlertConfig;
  private lastAlertTime: number = 0;
  private metricsHistory: JetStreamMetrics[] = [];

  constructor(config: AlertConfig) {
    this.config = config;
  }

  /**
   * Ingest metrics and evaluate alert conditions
   */
  public ingestMetrics(metrics: JetStreamMetrics): void {
    this.metricsHistory.push(metrics);

    // Keep only recent metrics within evaluation window
    const cutoff = Date.now() - this.config.evaluationWindowMs;
    this.metricsHistory = this.metricsHistory.filter(
      m => m.timestamp >= cutoff
    );
  }

  /**
   * Evaluate current alert state
   */
  public evaluateAlerts(): AlertResult {
    const now = Date.now();

    // Check cooldown
    if (now - this.lastAlertTime < this.config.cooldownMs) {
      return { shouldAlert: false, reason: 'Cooldown period active' };
    }

    if (this.metricsHistory.length === 0) {
      return { shouldAlert: false, reason: 'No metrics available' };
    }

    // Calculate averages over window
    const avgPrefillBacklog = this.calculateAveragePrefillBacklog();
    const avgSlotsUsed = this.calculateAverageSlotsUsed();

    // Alert on prefill backlog (latency signal)
    if (avgPrefillBacklog > this.config.prefillQueueThreshold) {
      this.lastAlertTime = now;
      return {
        shouldAlert: true,
        severity: 'warning',
        reason: `Prefill backlog high: ${avgPrefillBacklog.toFixed(1)} requests queued`,
        metrics: { avgPrefillBacklog, avgSlotsUsed }
      };
    }

    // Alert on slots used (throughput signal)
    if (avgSlotsUsed > this.config.slotsUsedThreshold) {
      this.lastAlertTime = now;
      return {
        shouldAlert: true,
        severity: 'critical',
        reason: `Slot utilization critical: ${(avgSlotsUsed * 100).toFixed(1)}%`,
        metrics: { avgPrefillBacklog, avgSlotsUsed }
      };
    }

    return { shouldAlert: false, reason: 'Within normal parameters' };
  }

  private calculateAveragePrefillBacklog(): number {
    return this.metricsHistory.reduce((sum, m) => sum + m.prefillBacklogSize, 0)
           / this.metricsHistory.length;
  }

  private calculateAverageSlotsUsed(): number {
    return this.metricsHistory.reduce((sum, m) => sum + m.slotsUsedPercentage, 0)
           / this.metricsHistory.length;
  }
}

interface AlertResult {
  shouldAlert: boolean;
  severity?: 'warning' | 'critical';
  reason: string;
  metrics?: {
    avgPrefillBacklog: number;
    avgSlotsUsed: number;
  };
}

// Example usage
const monitor = new JetStreamAlertMonitor({
  prefillQueueThreshold: 5,
  slotsUsedThreshold: 0.95,
  evaluationWindowMs: 300000,  // 5 minutes
  cooldownMs: 60000            // 1 minute
});

// Simulate metrics ingestion
monitor.ingestMetrics({
  prefillBacklogSize: 8,
  slotsUsedPercentage: 0.92,
  timestamp: Date.now()
});

const result = monitor.evaluateAlerts();
console.log(result);

# Error rate burn rate calculation for 99.9% SLO
# Short window (5m) for rapid detection
(
  sum(rate(llm_request_errors_total[5m]))
  /
  sum(rate(llm_requests_total[5m]))
)
/
0.001 > 14.4

# Long window (1h) for sustained issues
(
  sum(rate(llm_request_errors_total[1h]))
  /
  sum(rate(llm_requests_total[1h]))
)
/
0.001 > 14.4

# JetStream prefill backlog alert
# Alert if backlog growing and sustained above threshold
increase(jetstream_prefill_backlog_size[5m]) > 0
and
avg_over_time(jetstream_prefill_backlog_size[10m]) > 5

# HBM usage alert
# Scale up immediately if sustained high usage
avg_over_time(tpu_hbm_usage_ratio[5m]) > 0.95

# Cost anomaly detection
# Alert if token rate exceeds 7-day baseline by 1.5x
sum(rate(llm_tokens_total[5m]))
>
1.5 *
avg_over_time(sum(llm_tokens_total[7d])[5m:1m])

Cost Optimization Strategies

Consolidation Techniques

Service-level aggregation: Create one alert policy per service that aggregates across all model endpoints
Dynamic thresholds: Use burn rate calculations instead of fixed thresholds to reduce alert noise
Metamonitoring: Monitor your monitoring system to ensure alerts are delivered correctly

Expected Cost Savings

Based on Google Cloud’s pricing model, consolidating 50 individual endpoint alerts into 5 service-level alerts reduces costs by approximately 70% while improving detection accuracy through better signal-to-noise ratio.

Testing Your Alerting System

Before enabling production paging, validate your alerting rules:

Historical replay: Run rules against past incidents to verify detection
Synthetic testing: Generate controlled error rates to trigger alerts
Chaos engineering: Introduce real failures in staging environments
On-call drills: Test alert delivery and escalation monthly

Alert rule builder

Interactive widget derived from “Alerting Strategies for AI Systems: When to Page” that lets readers explore alert rule builder.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

SLO Fundamentals Learn the basics of Service Level Objectives for AI systems

Cost Monitoring Track and control LLM spending effectively

Incident Response Best practices for responding to AI system incidents

Summary

Effective AI alerting requires moving beyond simple threshold monitoring to sophisticated burn-rate calculations that measure user impact. By implementing multi-window, multi-burn-rate alerting, you can detect incidents in minutes rather than hours while controlling costs through smart aggregation.

The key principles are:

Alert on budget consumption, not absolute errors
Use dual windows to prevent false positives
Aggregate metrics at the service level
Test thoroughly before enabling production paging

Start with the Python SLO engine example to build your alerting foundation, then expand with JetStream monitoring and cost anomaly detection as your system grows.

Alerting Strategies for AI Systems: When to Page

Alerting Strategies for AI Systems: When to Page

Why This Matters

Understanding SLO-Based Alerting for AI

The Multi-Window, Multi-Burn-Rate Strategy

Why Raw Metrics Fail for AI

Core Alert Types for LLM Systems

1. Error Rate & Budget Consumption Alerts

2. Queue-Based Latency Alerts

3. Capacity & Autoscaling Alerts

4. Cost Anomaly Detection

Practical Implementation: Building Your Alerting System

Code Examples: Production-Ready Alerting

Cost Optimization Strategies

Consolidation Techniques

Expected Cost Savings

Testing Your Alerting System

Widget

Related Resources

Summary