A 99.9% SLO means your error budget is just 0.1%—for 60,480 requests over 7 days, that’s only 9,072 failures allowed. But here’s what most teams miss: alerting on raw error rates without burn rate logic will either miss incidents or drown you in false positives. This guide covers production-ready latency monitoring, SLA enforcement, and multi-burn-rate alerting that scales from low-traffic APIs to high-volume LLM inference pipelines.
Latency SLAs aren’t just about keeping users happy—they’re about managing risk and cost. For LLM services, latency directly impacts token consumption and inference costs. A service that degrades from p99 200ms to 2s might trigger retries, doubling your token spend. For a service processing 1M requests/day at $0.001/request, a 24-hour incident at 2x token usage costs $2,000 in avoidable spend.
Traditional monitoring fails for LLM services because:
Token generation is non-deterministic: Same prompt can produce different latencies
Context windows vary: 200K token contexts process differently than 8K
Cost scales with latency: Longer processing = more tokens = higher bills
The solution: SLIs that capture user experience, SLOs that define acceptable risk, and alerting that burns error budget proportionally.
An SLI is a measurable metric that reflects user experience. For latency, the gold standard is request-based percentile latency.
Key Latency SLIs:
p50 (Median): Typical user experience
p95: What 95% of users experience
p99: Worst-case for 1% of users (the tail)
Why Histograms?
Histograms bucket latency values, enabling percentile calculations. You can’t calculate p99 from a simple counter or gauge—you need bucketed observations.
Traditional alerting: “Alert if error rate > 0.1% for 5 minutes.”
Problem: During low traffic, one failure triggers; during high traffic, you miss incidents.
Multi-burn-rate alerting (from Google SRE) uses burn rate to scale alert sensitivity:
Alert Severity
Budget Consumed
Time Window
Burn Rate (for 99.9% SLO)
PAGE
2%
1 hour
14.4x
PAGE
5%
6 hours
6.0x
TICKET
10%
3 days
1.0x
How it works:
Fast burn (14.4x in 1 hour): Page immediately—budget will exhaust in ~2.5 hours
Medium burn (6x in 6 hours): Page—budget exhausts in 24 hours
Slow burn (1x in 3 days): Create ticket—budget exhausts in 30 days
This prevents false positives while catching real incidents quickly.
Choose histogram buckets that align with your SLO thresholds. For a p99 SLO of 200ms, use buckets like: 50ms, 100ms, 150ms, 200ms, 250ms, 500ms, 1000ms. This ensures accurate percentile calculations around your critical threshold.
Effective latency SLA enforcement requires three interconnected components: proper SLI instrumentation, realistic SLO targets, and burn-rate-based alerting.
Core Principles:
Use percentiles, not averages - p99 latency reveals user impact that averages hide
Set SLOs below 100% - Your error budget is your safety valve for deployments and innovation
Alert on burn rate, not raw errors - Multi-window, multi-burn-rate alerting scales with traffic and incident severity
Align buckets to SLOs - Histogram buckets must capture values around your latency thresholds
For LLM Services:
Traditional latency monitoring must be adapted for token-based workloads. Monitor time-to-first-token (TTFT), token throughput, and cost-per-token alongside request latency. A service degrading from 200ms to 2s p99 can double token costs through retries and extended processing.
Implementation Checklist:
✅ Instrument with histograms (Python/TypeScript examples provided)
✅ Calculate p50/p95/p99 from bucket data
✅ Define SLO targets based on user experience, not current performance
Raw error rate alerts (noise during low/high traffic)
Duration clauses (don’t scale with severity)
Over-achieving SLOs (creates hidden dependencies)
The TrackAI widget and code examples provided give you production-ready patterns for implementing these principles. Start with the burn rate calculator and latency histogram examples, then layer in the multi-window alerting strategy for comprehensive coverage.
Specific latency benchmarks for LLM inference (tokens/sec, p99) vary by hardware (GPU type, model size) and deployment configuration. Authoritative benchmarks were not available from approved sources.
Multi-burn-rate alerting effectiveness for token-generation latency (vs. request-based latency) requires empirical validation in production environments.
Vertex AI observability dashboard capabilities for self-hosted models need verification against your specific deployment pattern.