Monitoring and Alerting on Latency: SLA Enforcement

A 99.9% SLO means your error budget is just 0.1%—for 60,480 requests over 7 days, that’s only 9,072 failures allowed. But here’s what most teams miss: alerting on raw error rates without burn rate logic will either miss incidents or drown you in false positives. This guide covers production-ready latency monitoring, SLA enforcement, and multi-burn-rate alerting that scales from low-traffic APIs to high-volume LLM inference pipelines.

Why This Matters

Latency SLAs aren’t just about keeping users happy—they’re about managing risk and cost. For LLM services, latency directly impacts token consumption and inference costs. A service that degrades from p99 200ms to 2s might trigger retries, doubling your token spend. For a service processing 1M requests/day at $0.001/request, a 24-hour incident at 2x token usage costs $2,000 in avoidable spend.

Traditional monitoring fails for LLM services because:

Token generation is non-deterministic: Same prompt can produce different latencies
Context windows vary: 200K token contexts process differently than 8K
Cost scales with latency: Longer processing = more tokens = higher bills

The solution: SLIs that capture user experience, SLOs that define acceptable risk, and alerting that burns error budget proportionally.

Core Concepts: SLIs, SLOs, and SLAs

Service Level Indicators (SLIs)

An SLI is a measurable metric that reflects user experience. For latency, the gold standard is request-based percentile latency.

Key Latency SLIs:

p50 (Median): Typical user experience
p95: What 95% of users experience
p99: Worst-case for 1% of users (the tail)

Why Histograms? Histograms bucket latency values, enabling percentile calculations. You can’t calculate p99 from a simple counter or gauge—you need bucketed observations.

Service Level Objectives (SLOs)

An SLO is a target for your SLI. For a 99.9% latency SLO over 30 days:

Error budget = 0.1% of requests can exceed your latency threshold
Burn rate = actual error rate / SLO error rate

Example:

SLO: 99.9% of requests < 200ms p99
Error rate: 0.2% of requests > 200ms
Burn rate = 0.2% / 0.1% = 2x (burning budget 2x faster than SLO allows)

Service Level Agreements (SLAs)

SLAs are external contracts with penalties. SLOs are internal targets. Never promise 100% uptime in an SLA—your error budget is your safety valve.

Setting Thresholds and Alerting Strategies

The Multi-Burn-Rate Method

Traditional alerting: “Alert if error rate > 0.1% for 5 minutes.”
Problem: During low traffic, one failure triggers; during high traffic, you miss incidents.

Multi-burn-rate alerting (from Google SRE) uses burn rate to scale alert sensitivity:

Alert Severity	Budget Consumed	Time Window	Burn Rate (for 99.9% SLO)
PAGE	2%	1 hour	14.4x
PAGE	5%	6 hours	6.0x
TICKET	10%	3 days	1.0x

How it works:

Fast burn (14.4x in 1 hour): Page immediately—budget will exhaust in ~2.5 hours
Medium burn (6x in 6 hours): Page—budget exhausts in 24 hours
Slow burn (1x in 3 days): Create ticket—budget exhausts in 30 days

This prevents false positives while catching real incidents quickly.

Choosing Percentile Targets

Common LLM Service Targets:

Chatbots: p99 < 2s (human conversation pace)
RAG pipelines: p99 < 5s (acceptable for knowledge retrieval)
Code generation: p99 < 10s (complex tasks tolerate longer waits)

Rule of thumb: Set SLO at 1.5–2x your median (p50) latency. If p50 is 500ms, p99 SLO of 1–2s is realistic.

Traffic Considerations

Low-traffic services (< 1,000 requests/day): Single failures can consume budget. Solutions:

Generate synthetic traffic for monitoring
Use client-side retries to smooth spikes
Implement minimum request count thresholds in alerts

High-traffic services (> 100K requests/day): Tail latency becomes critical. Focus on p99 and p99.9.

Practical Implementation

Instrument your service with histograms

Choose buckets aligned to your SLO. If p99 SLO is 200ms, ensure buckets capture values around 200ms (e.g., 100ms, 150ms, 200ms, 250ms, 500ms).
Calculate SLIs from histogram data

Use histogram_quantile() in Prometheus or equivalent functions in your metrics store to extract p50, p95, p99 from buckets.
Define burn rate alerting rules

Create alerts that trigger based on budget consumption over multiple windows, not raw error rates.
Test alerting with simulated incidents

Inject controlled latency spikes or errors to verify alerts fire correctly without false positives.
Review and adjust quarterly

SLOs should reflect user needs, not current performance. If you’re over-achieving, lower the SLO to free error budget for innovation.

Code Examples

import time
import random
from prometheus_client import Histogram, start_http_server, CollectorRegistry

# Define histogram with buckets aligned to SLO targets
# For p99 SLO of 200ms, include buckets around that threshold
registry = CollectorRegistry()
latency_histogram = Histogram(
    'request_latency_seconds',
    'Latency of HTTP requests',
    buckets=[.01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5],
    registry=registry
)

# Start metrics endpoint
start_http_server(8000, registry=registry)
print("Metrics available at http://localhost:8000/metrics")

# Simulate realistic request patterns
for i in range(1000):
    start = time.time()

    # Exponential distribution mimics real-world latency
    processing_time = random.expovariate(10)
    time.sleep(processing_time)

    # Record observation
    latency_histogram.observe(time.time() - start)

    if i % 100 == 0:
        print(f"Processed {i} requests")

# Calculate p99 using PromQL:
# histogram_quantile(0.99, rate(request_latency_seconds_bucket[5m]))

import { Histogram, Registry } from 'prom-client';

const registry = new Registry();
const latencyHistogram = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
  registers: [registry]
});

// Express.js middleware
export function latencyMiddleware(req: any, res: any, next: any) {
  const start = process.hrtime.bigint();
  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    latencyHistogram.observe({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    }, duration);
  });
  next();
}

// SLI compliance checker
export async function checkSLOCompliance(thresholdMs: number): Promise<boolean> {
  // In production, query Prometheus/Cloud Monitoring
  // For demo: calculate from registry
  const metrics = await registry.getMetricsAsJSON();
  const histogram = metrics.find(m => m.name === 'http_request_duration_seconds');

  if (!histogram?.values) return false;

  // Simplified p99 calculation
  const buckets = histogram.values.filter(v => v.metricName === 'http_request_duration_seconds_bucket');
  const totalCount = buckets.reduce((sum, b) => sum + (b.value || 0), 0);
  const p99Count = totalCount * 0.99;

  let cumulative = 0;
  for (const bucket of buckets) {
    cumulative += bucket.value || 0;
    if (cumulative >= p99Count) {
      const p99Latency = parseFloat(bucket.labels?.le || '0') * 1000;
      return p99Latency <= thresholdMs;
    }
  }
  return false;
}

class BurnRateAlerting:
    """Multi-window, multi-burn-rate alerting (Google SRE method)"""

    def __init__(self, slo_percentage: float, time_window_days: int = 30):
        self.slo_percentage = slo_percentage
        self.error_budget_percentage = 100 - slo_percentage

    def calculate_burn_rate(self, error_rate: float) -> float:
        """Burn rate = actual error rate / SLO error rate"""
        slo_error_rate = self.error_budget_percentage / 100
        return error_rate / slo_error_rate if slo_error_rate > 0 else float('inf')

    def should_alert(self, error_rate: float, window_hours: int) -> tuple[bool, str, float]:
        """Returns (should_alert, severity, budget_consumed_percent)"""
        burn_rate = self.calculate_burn_rate(error_rate)
        budget_consumed = (burn_rate * window_hours) / (30 * 24) * 100

        # Google SRE thresholds
        if window_hours == 1 and budget_consumed >= 2:
            return True, "PAGE", budget_consumed
        elif window_hours == 6 and budget_consumed >= 5:
            return True, "PAGE", budget_consumed
        elif window_hours == 72 and budget_consumed >= 10:
            return True, "TICKET", budget_consumed

        return False, "OK", budget_consumed

# Example: 99.9% SLO
alerting = BurnRateAlerting(slo_percentage=99.9)

# Test scenarios
scenarios = [
    (0.0005, "Normal (0.05% errors)"),
    (0.002, "Moderate (0.2% errors)"),
    (0.01, "Severe (1% errors)"),
    (1.0, "Catastrophic (100% errors)")
]

for error_rate, description in scenarios:
    should_page, severity, budget = alerting.should_alert(error_rate, window_hours=1)
    burn_rate = alerting.calculate_burn_rate(error_rate)
    print(f"{description}: Burn={burn_rate:.1f}x, Budget={budget:.2f}%, Alert={should_page} ({severity})")

Common Pitfalls

Avoid these critical mistakes that undermine SLO effectiveness:

100% SLO Targets - Provides zero error budget, prevents deployments and innovation. Always set SLOs below 100% to allow controlled risk-taking.
Averages for Latency - Hides tail latency. If 99 requests are 100ms and 1 is 10s, average shows 199ms while 1% of users suffer catastrophic delays.
Raw Error Rate Alerts - Triggers false positives during low traffic and misses incidents during high traffic. Burn rate scales alert sensitivity with traffic volume.
Duration Clauses - Fixed time windows (e.g., “error rate > X for 5 minutes”) don’t adapt to incident severity. Multi-burn-rate uses multiple windows for precision.
Over-Achieving SLOs - If you consistently beat your SLO by 10x, users build dependencies on the better performance. Consider lowering the SLO or planning controlled outages.
Ignoring Low Traffic - Single failures can consume 100% of budget for services with < 100 requests/day. Use synthetic traffic or minimum request thresholds.
Too Many SLOs - Creates alert fatigue and dilutes focus. Start with 2-3 critical SLIs per service.
No Reset Strategy - Long reset times (36+ hours) cause confusion. Multi-window approaches provide faster, clearer resets.

Quick Reference

Latency SLI Formulas

Request-based p99 SLI:

SLI = (Requests with latency < threshold) / (Total requests)

Windows-based p99 SLI:

SLI = (Time windows where p99 < threshold) / (Total windows)

Burn Rate Thresholds (99.9% SLO)

Alert Type	Budget	Window	Burn Rate	Error Rate Threshold
PAGE	2%	1 hour	14.4x	14.4 × 0.1% = 0.144%
PAGE	5%	6 hours	6.0x	6.0 × 0.1% = 0.06%
TICKET	10%	3 days	1.0x	1.0 × 0.1% = 0.01%

Percentile Targets by Service Type

Service	p50 Target	p99 Target	Use Case
Chatbots	≤ 500ms	≤ 2s	Human conversation pace
RAG Pipelines	≤ 1s	≤ 5s	Knowledge retrieval
Code Generation	≤ 2s	≤ 10s	Complex tasks tolerate longer waits

Bucket Configuration Guide

Choose histogram buckets that align with your SLO thresholds. For a p99 SLO of 200ms, use buckets like: 50ms, 100ms, 150ms, 200ms, 250ms, 500ms, 1000ms. This ensures accurate percentile calculations around your critical threshold.

SLA definition tool + alert rule templates

Interactive widget derived from “Monitoring and Alerting on Latency: SLA Enforcement” that lets readers explore sla definition tool + alert rule templates.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Effective latency SLA enforcement requires three interconnected components: proper SLI instrumentation, realistic SLO targets, and burn-rate-based alerting.

Core Principles:

Use percentiles, not averages - p99 latency reveals user impact that averages hide
Set SLOs below 100% - Your error budget is your safety valve for deployments and innovation
Alert on burn rate, not raw errors - Multi-window, multi-burn-rate alerting scales with traffic and incident severity
Align buckets to SLOs - Histogram buckets must capture values around your latency thresholds

For LLM Services: Traditional latency monitoring must be adapted for token-based workloads. Monitor time-to-first-token (TTFT), token throughput, and cost-per-token alongside request latency. A service degrading from 200ms to 2s p99 can double token costs through retries and extended processing.

Implementation Checklist:

✅ Instrument with histograms (Python/TypeScript examples provided)
✅ Calculate p50/p95/p99 from bucket data
✅ Define SLO targets based on user experience, not current performance
✅ Configure multi-burn-rate alerts (1h, 6h, 72h windows)
✅ Test alerts with simulated incidents
✅ Review and adjust quarterly

Common Pitfalls to Avoid:

100% SLO targets (no error budget)
Averages for latency (hides tail)
Raw error rate alerts (noise during low/high traffic)
Duration clauses (don’t scale with severity)
Over-achieving SLOs (creates hidden dependencies)

The TrackAI widget and code examples provided give you production-ready patterns for implementing these principles. Start with the burn rate calculator and latency histogram examples, then layer in the multi-window alerting strategy for comprehensive coverage.

Core Documentation:

Concepts in service monitoring - Google Cloud’s authoritative guide on SLI/SLO/error budget fundamentals
Using Prometheus metrics - Instrumentation patterns for latency histograms and SLI calculation
Alerting on burn rate - Multi-window alerting implementation details

Implementation Guides:

Vertex AI Model Observability - Prebuilt dashboards for fully-managed LLM services
SRE Workbook: Alerting on SLOs - Google’s operational practices for burn rate alerting

Code Repositories:

Prometheus client libraries: prometheus/client_python, prometheus/client_java
OpenTelemetry latency instrumentation: opentelemetry.io

Pricing Context (Verified): When calculating cost impact of latency incidents, reference current LLM pricing:

Claude 3.5 Sonnet: $3.00/$15.00 per 1M tokens (input/output) - Anthropic
GPT-4o: $5.00/$15.00 per 1M tokens - OpenAI
GPT-4o-mini: $0.15/$0.60 per 1M tokens - OpenAI
Haiku 3.5: $1.25/$5.00 per 1M tokens - Anthropic

Validation Required:

Specific latency benchmarks for LLM inference (tokens/sec, p99) vary by hardware (GPU type, model size) and deployment configuration. Authoritative benchmarks were not available from approved sources.
Multi-burn-rate alerting effectiveness for token-generation latency (vs. request-based latency) requires empirical validation in production environments.
Vertex AI observability dashboard capabilities for self-hosted models need verification against your specific deployment pattern.

Monitoring and Alerting on Latency: SLA Enforcement

Monitoring and Alerting on Latency: SLA Enforcement

Why This Matters

Core Concepts: SLIs, SLOs, and SLAs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

Setting Thresholds and Alerting Strategies

The Multi-Burn-Rate Method

Choosing Percentile Targets

Traffic Considerations

Practical Implementation

Code Examples

Common Pitfalls

Quick Reference

Latency SLI Formulas

Burn Rate Thresholds (99.9% SLO)

Percentile Targets by Service Type

Bucket Configuration Guide

Widget

Summary

Related Resources