A Series A startup discovered their production LLM service had been returning 500 errors for six hours on a Saturday night. Their monitoring dashboard showed perfect metrics because they were alerting on raw error counts, not error rates. By Monday morning, they had lost 12,000 customer interactions and faced a wave of churn. This guide provides battle-tested alerting strategies specifically designed for AI systems, so you never face a similar surprise.
Traditional application monitoring fails for AI systems because LLMs have unique failure modes: hallucinations, rate limiting, token budget exhaustion, and queue-based latency spikes. A 2% error rate in a standard web API might be acceptable, but in a 99.9% SLO LLM service, that same error rate consumes 20% of your monthly error budget in just 50 minutes.
The cost implications are severe. According to Google Cloud’s alerting documentation, alerting policies charge per-condition costs. A poorly designed system with 50 separate alerts can cost thousands per month, while a consolidated approach reduces costs by 70% while improving detection accuracy. For context, using Claude 3.5 Sonnet at $3.00/$15.00 per 1M tokens, a single misconfigured alert that triggers a retry storm can burn through hundreds of dollars before engineers respond.
SLO-based alerting transforms how you detect AI system issues by focusing on user impact rather than internal metrics. Instead of asking “is the error count high?”, you ask “are we consuming our error budget faster than expected?”
This approach, pioneered by Google SRE, uses two time windows to validate alerts: a short window for rapid detection and a long window to confirm sustained problems. The dual-window validation prevents false positives from transient spikes while maintaining fast detection for real incidents.
For a 99.9% SLO (0.1% error budget), here’s how burn rates translate to budget consumption:
Budget Consumed
Time to Exhaust
Burn Rate
Alert Severity
2% in 1 hour
50 hours
14.4x
Page (immediate)
5% in 6 hours
120 hours
6x
Page (urgent)
10% in 3 days
30 days
1x
Ticket (business hours)
These thresholds ensure you page on-call engineers only when the incident poses immediate budget risk, while lower-severity issues generate tickets for business-hours response.
Alerting on raw, unaggregated metrics creates three critical problems:
Cardinality explosion: Individual model endpoints generate millions of unique metric series, multiplying alerting costs exponentially
False positives: Single request failures trigger alerts even when automatic retries succeed
Poor detection time: Fixed-duration clauses (e.g., “for: 1h”) cannot distinguish between a 1% error rate sustained for an hour versus a 50% error rate for 2 minutes
The recommended approach aggregates metrics by service and uses burn rate calculations that account for both severity and duration.
These alerts detect when your LLM service is violating SLO targets. The key is measuring error rates against your error budget, not absolute thresholds.
Recommended thresholds for 99.9% SLO:
Page: 2% budget consumed in 1 hour (14.4x burn rate)
Page: 5% budget consumed in 6 hours (6x burn rate)
Ticket: 10% budget consumed in 3 days (1x burn rate)
For systems using inference servers like JetStream, queue metrics provide early warning before latency degrades. Google Cloud’s best practices recommend monitoring:
jetstream_prefill_backlog_size: Number of requests waiting for prefill (latency signal)
jetstream_slots_used_percentage: Percentage of decode slots in use (throughput signal)
A prefill backlog of greater than 5 requests with positive growth rate indicates impending latency issues, while slot utilization greater than 95% means you’re at capacity and should scale immediately.
Given LLM costs ($3-15 per 1M tokens), sudden traffic spikes can create bill shock. Alert on token consumption rates that exceed historical baselines by significant margins.
Practical Implementation: Building Your Alerting System
Choose realistic availability targets based on user expectations. For consumer-facing chatbots, 99.9% is common. For critical financial applications, 99.95% might be required. Document your error budget: 99.9% = 43 minutes of downtime per month.
Calculate burn rate thresholds
For your chosen SLO, compute the burn rates that consume 2%, 5%, and 10% of budget in your target windows. Use the formula: burn_rate = (error_rate / error_budget). For 99.9% SLO (0.001 budget), a 1.44% error rate = 14.4x burn rate.
Consolidate alerts using aggregation
Instead of separate policies per model endpoint, create one policy per service that aggregates across all endpoints. This reduces per-condition costs by 70% and provides a unified view.
Implement dual-window validation
Configure both long and short windows for each threshold. The short window (5-30 minutes) detects rapid changes, while the long window (1-3 days) confirms sustained issues.
Set up metamonitoring
Monitor your monitoring system. Ensure alert delivery is working and that alert silence windows are properly configured. Test your on-call rotation monthly.
Tune thresholds with historical data
Run your alerting rules in “monitoring mode” for 2-4 weeks before enabling paging. Adjust thresholds based on your actual traffic patterns and error rates.
The following production-ready examples demonstrate how to implement the multi-burn-rate strategy for LLM services. Each example focuses on a different aspect of AI system monitoring.
Based on Google Cloud’s pricing model, consolidating 50 individual endpoint alerts into 5 service-level alerts reduces costs by approximately 70% while improving detection accuracy through better signal-to-noise ratio.
Effective AI alerting requires moving beyond simple threshold monitoring to sophisticated burn-rate calculations that measure user impact. By implementing multi-window, multi-burn-rate alerting, you can detect incidents in minutes rather than hours while controlling costs through smart aggregation.
The key principles are:
Alert on budget consumption, not absolute errors
Use dual windows to prevent false positives
Aggregate metrics at the service level
Test thoroughly before enabling production paging
Start with the Python SLO engine example to build your alerting foundation, then expand with JetStream monitoring and cost anomaly detection as your system grows.