Skip to content
GitHubX/TwitterRSS

Monitoring and Alerting on Latency: SLA Enforcement

Monitoring and Alerting on Latency: SLA Enforcement

Section titled “Monitoring and Alerting on Latency: SLA Enforcement”

A 99.9% SLO means your error budget is just 0.1%—for 60,480 requests over 7 days, that’s only 9,072 failures allowed. But here’s what most teams miss: alerting on raw error rates without burn rate logic will either miss incidents or drown you in false positives. This guide covers production-ready latency monitoring, SLA enforcement, and multi-burn-rate alerting that scales from low-traffic APIs to high-volume LLM inference pipelines.

Latency SLAs aren’t just about keeping users happy—they’re about managing risk and cost. For LLM services, latency directly impacts token consumption and inference costs. A service that degrades from p99 200ms to 2s might trigger retries, doubling your token spend. For a service processing 1M requests/day at $0.001/request, a 24-hour incident at 2x token usage costs $2,000 in avoidable spend.

Traditional monitoring fails for LLM services because:

  • Token generation is non-deterministic: Same prompt can produce different latencies
  • Context windows vary: 200K token contexts process differently than 8K
  • Cost scales with latency: Longer processing = more tokens = higher bills

The solution: SLIs that capture user experience, SLOs that define acceptable risk, and alerting that burns error budget proportionally.

An SLI is a measurable metric that reflects user experience. For latency, the gold standard is request-based percentile latency.

Key Latency SLIs:

  • p50 (Median): Typical user experience
  • p95: What 95% of users experience
  • p99: Worst-case for 1% of users (the tail)

Why Histograms? Histograms bucket latency values, enabling percentile calculations. You can’t calculate p99 from a simple counter or gauge—you need bucketed observations.

An SLO is a target for your SLI. For a 99.9% latency SLO over 30 days:

  • Error budget = 0.1% of requests can exceed your latency threshold
  • Burn rate = actual error rate / SLO error rate

Example:

  • SLO: 99.9% of requests < 200ms p99
  • Error rate: 0.2% of requests > 200ms
  • Burn rate = 0.2% / 0.1% = 2x (burning budget 2x faster than SLO allows)

SLAs are external contracts with penalties. SLOs are internal targets. Never promise 100% uptime in an SLA—your error budget is your safety valve.

Setting Thresholds and Alerting Strategies

Section titled “Setting Thresholds and Alerting Strategies”

Traditional alerting: “Alert if error rate > 0.1% for 5 minutes.”
Problem: During low traffic, one failure triggers; during high traffic, you miss incidents.

Multi-burn-rate alerting (from Google SRE) uses burn rate to scale alert sensitivity:

Alert SeverityBudget ConsumedTime WindowBurn Rate (for 99.9% SLO)
PAGE2%1 hour14.4x
PAGE5%6 hours6.0x
TICKET10%3 days1.0x

How it works:

  • Fast burn (14.4x in 1 hour): Page immediately—budget will exhaust in ~2.5 hours
  • Medium burn (6x in 6 hours): Page—budget exhausts in 24 hours
  • Slow burn (1x in 3 days): Create ticket—budget exhausts in 30 days

This prevents false positives while catching real incidents quickly.

Common LLM Service Targets:

  • Chatbots: p99 < 2s (human conversation pace)
  • RAG pipelines: p99 < 5s (acceptable for knowledge retrieval)
  • Code generation: p99 < 10s (complex tasks tolerate longer waits)

Rule of thumb: Set SLO at 1.5–2x your median (p50) latency. If p50 is 500ms, p99 SLO of 1–2s is realistic.

Low-traffic services (< 1,000 requests/day): Single failures can consume budget. Solutions:

  • Generate synthetic traffic for monitoring
  • Use client-side retries to smooth spikes
  • Implement minimum request count thresholds in alerts

High-traffic services (> 100K requests/day): Tail latency becomes critical. Focus on p99 and p99.9.

  1. Instrument your service with histograms

    Choose buckets aligned to your SLO. If p99 SLO is 200ms, ensure buckets capture values around 200ms (e.g., 100ms, 150ms, 200ms, 250ms, 500ms).

  2. Calculate SLIs from histogram data

    Use histogram_quantile() in Prometheus or equivalent functions in your metrics store to extract p50, p95, p99 from buckets.

  3. Define burn rate alerting rules

    Create alerts that trigger based on budget consumption over multiple windows, not raw error rates.

  4. Test alerting with simulated incidents

    Inject controlled latency spikes or errors to verify alerts fire correctly without false positives.

  5. Review and adjust quarterly

    SLOs should reflect user needs, not current performance. If you’re over-achieving, lower the SLO to free error budget for innovation.

import time
import random
from prometheus_client import Histogram, start_http_server, CollectorRegistry
# Define histogram with buckets aligned to SLO targets
# For p99 SLO of 200ms, include buckets around that threshold
registry = CollectorRegistry()
latency_histogram = Histogram(
'request_latency_seconds',
'Latency of HTTP requests',
buckets=[.01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5],
registry=registry
)
# Start metrics endpoint
start_http_server(8000, registry=registry)
print("Metrics available at http://localhost:8000/metrics")
# Simulate realistic request patterns
for i in range(1000):
start = time.time()
# Exponential distribution mimics real-world latency
processing_time = random.expovariate(10)
time.sleep(processing_time)
# Record observation
latency_histogram.observe(time.time() - start)
if i % 100 == 0:
print(f"Processed {i} requests")
# Calculate p99 using PromQL:
# histogram_quantile(0.99, rate(request_latency_seconds_bucket[5m]))

Avoid these critical mistakes that undermine SLO effectiveness:

Request-based p99 SLI:

SLI = (Requests with latency < threshold) / (Total requests)

Windows-based p99 SLI:

SLI = (Time windows where p99 < threshold) / (Total windows)
Alert TypeBudgetWindowBurn RateError Rate Threshold
PAGE2%1 hour14.4x14.4 × 0.1% = 0.144%
PAGE5%6 hours6.0x6.0 × 0.1% = 0.06%
TICKET10%3 days1.0x1.0 × 0.1% = 0.01%
Servicep50 Targetp99 TargetUse Case
Chatbots≤ 500ms≤ 2sHuman conversation pace
RAG Pipelines≤ 1s≤ 5sKnowledge retrieval
Code Generation≤ 2s≤ 10sComplex tasks tolerate longer waits

Choose histogram buckets that align with your SLO thresholds. For a p99 SLO of 200ms, use buckets like: 50ms, 100ms, 150ms, 200ms, 250ms, 500ms, 1000ms. This ensures accurate percentile calculations around your critical threshold.

SLA definition tool + alert rule templates

Interactive widget derived from “Monitoring and Alerting on Latency: SLA Enforcement” that lets readers explore sla definition tool + alert rule templates.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Effective latency SLA enforcement requires three interconnected components: proper SLI instrumentation, realistic SLO targets, and burn-rate-based alerting.

Core Principles:

  1. Use percentiles, not averages - p99 latency reveals user impact that averages hide
  2. Set SLOs below 100% - Your error budget is your safety valve for deployments and innovation
  3. Alert on burn rate, not raw errors - Multi-window, multi-burn-rate alerting scales with traffic and incident severity
  4. Align buckets to SLOs - Histogram buckets must capture values around your latency thresholds

For LLM Services: Traditional latency monitoring must be adapted for token-based workloads. Monitor time-to-first-token (TTFT), token throughput, and cost-per-token alongside request latency. A service degrading from 200ms to 2s p99 can double token costs through retries and extended processing.

Implementation Checklist:

  • ✅ Instrument with histograms (Python/TypeScript examples provided)
  • ✅ Calculate p50/p95/p99 from bucket data
  • ✅ Define SLO targets based on user experience, not current performance
  • ✅ Configure multi-burn-rate alerts (1h, 6h, 72h windows)
  • ✅ Test alerts with simulated incidents
  • ✅ Review and adjust quarterly

Common Pitfalls to Avoid:

  • 100% SLO targets (no error budget)
  • Averages for latency (hides tail)
  • Raw error rate alerts (noise during low/high traffic)
  • Duration clauses (don’t scale with severity)
  • Over-achieving SLOs (creates hidden dependencies)

The TrackAI widget and code examples provided give you production-ready patterns for implementing these principles. Start with the burn rate calculator and latency histogram examples, then layer in the multi-window alerting strategy for comprehensive coverage.

Core Documentation:

Implementation Guides:

Code Repositories:

Pricing Context (Verified): When calculating cost impact of latency incidents, reference current LLM pricing:

  • Claude 3.5 Sonnet: $3.00/$15.00 per 1M tokens (input/output) - Anthropic
  • GPT-4o: $5.00/$15.00 per 1M tokens - OpenAI
  • GPT-4o-mini: $0.15/$0.60 per 1M tokens - OpenAI
  • Haiku 3.5: $1.25/$5.00 per 1M tokens - Anthropic

Validation Required:

  • Specific latency benchmarks for LLM inference (tokens/sec, p99) vary by hardware (GPU type, model size) and deployment configuration. Authoritative benchmarks were not available from approved sources.
  • Multi-burn-rate alerting effectiveness for token-generation latency (vs. request-based latency) requires empirical validation in production environments.
  • Vertex AI observability dashboard capabilities for self-hosted models need verification against your specific deployment pattern.