Skip to content
GitHubX/TwitterRSS

Monitoring Token Spend in Real-Time: Building Cost Observability

Monitoring Token Spend in Real-Time: Building Cost Observability

Section titled “Monitoring Token Spend in Real-Time: Building Cost Observability”

A single unmonitored production feature can burn through $10,000 in a weekend. One engineering team discovered this when their new “smart reply” feature—launched on Friday—generated 4.2 million output tokens by Monday morning. Without real-time monitoring, they had no warning until the invoice arrived. This guide will show you how to build cost observability that catches these surprises before they become budget disasters.

Traditional monitoring focuses on latency, uptime, and error rates—metrics that impact user experience. But in the age of LLMs, cost per token is equally critical. A 50ms latency improvement means nothing if it costs you $50,000 more per month.

The challenge is that token costs are invisible until billed. Unlike database queries where you can see row counts and execution plans, LLM calls are black boxes. You get a response, but the token burn happens behind the API curtain. This invisibility creates several risks:

  1. Feature creep: A prompt that starts at 500 tokens can balloon to 2,000 tokens as engineers add context, examples, and instructions
  2. User behavior spikes: A viral feature or bot traffic can multiply your volume 10x overnight
  3. Model upgrades: Switching from GPT-4o-mini to GPT-4o increases costs 33x for the same request volume
  4. Retry storms: Poor error handling can multiply costs by 3-5x through redundant calls

Real-time observability transforms token spend from an accounting surprise into a manageable engineering metric. You can attribute costs to features, detect anomalies instantly, and make informed tradeoffs between capability and cost.

Building effective monitoring requires instrumenting four key layers of your LLM stack:

Every LLM request must be wrapped with tracking. This is your ground truth—the raw data that feeds all aggregation and analysis.

What to capture per call:

  • Provider (OpenAI, Anthropic, etc.)
  • Model used
  • Input token count (prompt + context)
  • Output token count (completion)
  • Cache hit/miss status (if using prompt caching)
  • Request timestamp
  • Cost in USD
  • Feature/endpoint identifier
  • User/team ID (for chargeback)

Raw API logs are too granular for business decisions. You need to aggregate by feature or product surface area to understand which capabilities are driving spend.

Aggregation dimensions:

  • Feature name (e.g., “summarization”, “code-review”, “chat-assistant”)
  • Endpoint or route (e.g., /api/v1/chat, /api/v1/summarize)
  • Environment (dev, staging, prod)
  • Model family (GPT-4, Claude 3.5, etc.)

This allows you to answer questions like: “Is our new RAG feature costing us more than it’s worth?” or “Which customer segment is burning the most tokens?“

Static dashboards are reactive. You need proactive anomaly detection that alerts you when spending deviates from expected patterns.

Key anomaly types:

  • Volume spikes: 2x increase in requests within an hour
  • Cost per request increase: Average cost per call jumps 50%+
  • Model drift: Unexpected model usage (e.g., production falling back to expensive models)
  • Cache efficiency drops: Cache hit rate drops below expected threshold

Detection without action is useless. Alerts must be:

  • Timely: Delivered within minutes, not hours
  • Actionable: Include enough context to investigate immediately
  • Escalating: Different thresholds for different severity levels
  1. Instrument your LLM calls with a wrapper that captures token counts and costs
  2. Stream data to your observability platform (e.g., Datadog, CloudWatch, or a custom data warehouse)
  3. Define cost baselines by aggregating historical data by feature and model
  4. Set up anomaly detection rules with thresholds for volume, cost-per-request, and cache efficiency
  5. Configure alerting channels (Slack, PagerDuty, email) with severity-based routing
  6. Build dashboards that visualize spend trends and top cost drivers
  7. Create runbooks for common alert scenarios

The difference between a manageable AI feature and a budget disaster often comes down to visibility. When you can see token burn in real-time, you shift from reactive cost accounting to proactive cost engineering.

Consider the economics: A single gpt-4o request processing 1,000 input tokens and generating 500 output tokens costs approximately $0.0125. Scale that to 100,000 requests per day, and you’re spending $1,250 daily—$37,500 per month. A 2x spike in output tokens (to 1,000) doubles that to $75,000/month. Without monitoring, you only discover this when the invoice arrives.

Real-time monitoring enables three critical capabilities:

1. Immediate Cost Attribution When your “smart reply” feature starts burning through tokens, you need to know which feature, which team, and which prompt version is responsible. This requires tracking at the API level with feature tags, not just provider invoices.

2. Automated Anomaly Detection A sudden 3x increase in token usage at 2 AM on Saturday should trigger an alert before Monday’s standup. Effective monitoring compares current spend against historical baselines and flags deviations immediately.

3. Informed Model Selection The pricing gap between models is dramatic. gpt-4o-mini costs 33x less than gpt-4o for input tokens ($0.15 vs $5 per 1M). Real-time cost tracking lets you validate whether the quality improvement justifies the expense for each use case.

Even well-intentioned teams make these mistakes that undermine cost observability:

Pitfall 1: Relying on Provider Billing Dashboards Provider dashboards show total spend but lack granularity. You can’t see which feature drove cost, which user triggered it, or whether it was a retry storm. By the time you see the spike, the damage is done.

Pitfall 2: Sampling Instead of Full Instrumentation Some teams log only 1% of requests to “save on logging costs.” This destroys anomaly detection—you’ll miss the 100x spike that hits the unlogged 99%. Every LLM call must be tracked.

Pitfall 3: Ignoring Input Token Costs Output tokens get attention because they’re visible in responses. But input tokens—especially with large context windows or RAG systems—can dominate costs. A 10,000-token system prompt multiplied across thousands of requests adds up fast.

Pitfall 4: Static Thresholds Setting a fixed daily budget alert ($1,000/day) ignores natural traffic patterns. Tuesday might be 3x higher than Sunday. Effective monitoring uses dynamic baselines that account for time-of-day and day-of-week patterns.

Pitfall 5: No Cache Hit Tracking Prompt caching can reduce costs by 50-90%, but only if you measure it. Teams that don’t track cache hit rates can’t optimize their cache strategy or verify they’re getting the expected savings.

ModelInput Cost (per 1M tokens)Output Cost (per 1M tokens)Context Window
OpenAI gpt-4o$5.00$15.00128,000
OpenAI gpt-4o-mini$0.15$0.60128,000
Anthropic claude-3-5-sonnet$3.00$15.00200,000
Anthropic haiku-3.5$1.25$5.00200,000

Source: OpenAI Pricing, Anthropic Models

Cost per Request = (Input Tokens × Input Rate) + (Output Tokens × Output Rate)
Where rates are per-token (divide per-1M rates by 1,000,000)

Example: 1,000 input tokens + 500 output tokens with gpt-4o:

(1,000 × $5/1M) + (500 × $15/1M) = $0.005 + $0.0075 = $0.0125 per request
MetricWarning ThresholdCritical ThresholdAction
Volume Spike150% of baseline250% of baselineInvestigate immediately
Cost per Request+50% vs average+100% vs averageCheck prompt changes
Cache Hit Rateless than 60%less than 40%Review cache strategy
Model DriftAny unexpected model usageFallback to expensive modelAudit routing logic

Cost monitoring dashboard template + alerting rules

Interactive widget derived from “Monitoring Token Spend in Real-Time: Building Cost Observability” that lets readers explore cost monitoring dashboard template + alerting rules.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Effective cost observability transforms token spend from an unpredictable expense into a controlled engineering metric. The three-layer approach—instrumentation, aggregation, and anomaly detection—provides the visibility needed to prevent budget disasters and optimize spend.

Key takeaways:

  • Instrument every call: Missing data cannot be reconstructed
  • Track input and output: Both contribute significantly to costs
  • Use dynamic baselines: Static thresholds miss pattern-based anomalies
  • Attribute to features: Know which capabilities drive spend
  • Alert immediately: Hours matter when costs are compounding

The pricing data shows dramatic differences between models—gpt-4o-mini costs 33x less than gpt-4o for input tokens. Real-time monitoring validates whether premium models deliver proportional value for each use case.

Start with basic instrumentation today. You can’t optimize what you can’t measure, and in the world of LLMs, measurement must happen in real-time, not after the invoice arrives.

Implementation Guides:

The following examples show how to instrument LLM calls with cost tracking. This wrapper captures all necessary metrics and streams them to an observability backend.

import time
import uuid
from typing import Dict, Any, Optional
import requests
class LLMCostTracker:
"""
Tracks token usage and costs for LLM API calls.
Streams data to observability backend.
"""
def __init__(self, observability_endpoint: str, api_key: str):
self.endpoint = observability_endpoint
self.api_key = api_key
self.session = requests.Session()
def track_call(
self,
provider: str,
model: str,
input_tokens: int,
output_tokens: int,
feature: str,
user_id: Optional[str] = None,
cache_hit: bool = False,
cache_tokens: int = 0
) -> Dict[str, Any]:
"""
Track a single LLM call with cost calculation.
"""
# Pricing per 1M tokens (as of Dec 2025)
PRICING = {
"openai": {
"gpt-4o": {"input": 5.00, "output": 15.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
},
"anthropic": {
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
"haiku-3.5": {"input": 1.25, "output": 5.00},
}
}
# Calculate costs
provider_pricing = PRICING.get(provider, {})
model_pricing = provider_pricing.get(model, {"input": 0.0, "output": 0.0})
input_cost = (input_tokens / 1_000_000) * model_pricing["input"]
output_cost = (output_tokens / 1_000_000) * model_pricing["output"]
# Cache savings calculation
cache_savings = 0
if cache_hit and cache_tokens > 0:
cache_savings = (cache_tokens / 1_000_000) * model_pricing["input"]
total_cost = input_cost + output_cost - cache_savings
# Build telemetry payload
telemetry_data = {
"trace_id": str(uuid.uuid4()),
"timestamp": time.time(),
"provider": provider,
"model": model,
"feature": feature,
"user_id": user_id,
"tokens": {
"input": input_tokens,
"output": output_tokens,
"cache_hit": cache_hit,
"cache_tokens": cache_tokens,
},
"costs": {
"input_usd": round(input_cost, 6),
"output_usd": round(output_cost, 6),
"cache_savings_usd": round(cache_savings, 6),
"total_usd": round(total_cost, 6),
},
"metadata": {
"environment": "production",
"version": "1.0.0",
}
}
# Stream to observability backend (async in production)
self._send_to_observability(telemetry_data)
return telemetry_data
def _send_to_observability(self, data: Dict[str, Any]) -> None:
"""
Send telemetry data to observability platform.
In production, use async queue or batch processing.
"""
try:
response = self.session.post(
f"{self.endpoint}/metrics/llm-costs",
json=data,
headers={"Authorization": f"Bearer {self.api_key}"},
timeout=2
)
if response.status_code != 200:
print(f"Warning: Failed to send telemetry: {response.status_code}")
except Exception as e:
# Fail silently in production - don't break the app
print(f"Telemetry error: {e}")
# Usage example
if __name__ == "__main__":
tracker = LLMCostTracker(
observability_endpoint="https://observability.example.com",
api_key="your-api-key"
)
# Simulate an LLM call
result = tracker.track_call(
provider="openai",
model="gpt-4o-mini",
input_tokens=1500,
output_tokens=450,
feature="smart-reply",
user_id="user_12345",
cache_hit=False
)
print(f"Tracked call: ${result['costs']['total_usd']:.6f}")
from datadog import initialize, api
import time
class DatadogCostReporter:
def __init__(self, api_key: str, app_key: str):
initialize(api_key=api_key, app_key=app_key)
def report_cost(self, feature: str, cost_usd: float, tokens: int):
"""Send cost metrics to Datadog"""
api.Metric.send([
{
"metric": "llm.cost_usd",
"points": [(time.time(), cost_usd)],
"tags": [f"feature:{feature}", "env:production"]
},
{
"metric": "llm.tokens",
"points": [(time.time(), tokens)],
"tags": [f"feature:{feature}", "env:production"]
}
])
import boto3
import json
class CloudWatchCostReporter:
def __init__(self, region: str = "us-east-1"):
self.logs = boto3.client("logs", region_name=region)
self.log_group = "/aws/llm/cost-tracking"
def log_call(self, data: dict):
"""Stream cost data to CloudWatch Logs"""
self.logs.put_log_events(
logGroupName=self.log_group,
logStreamName="llm-metrics",
logEvents=[{
"timestamp": int(time.time() * 1000),
"message": json.dumps(data)
}]
)
def detect_volume_spike(current_hour: int, baseline: int, threshold: float = 1.5) -> bool:
"""
Detect if current token volume exceeds baseline by threshold.
Returns True if spike detected.
"""
return current_hour > (baseline * threshold)
# Example usage
baseline_tokens = 50000 # Historical average per hour
current_tokens = 125000 # Current hour
if detect_volume_spike(current_tokens, baseline_tokens):
send_alert("Volume spike detected", f"Token usage at {current_tokens} vs baseline {baseline_tokens}")
def detect_cost_anomaly(
current_avg_cost: float,
historical_avg: float,
std_dev: float,
sensitivity: float = 2.0
) -> bool:
"""
Detect if average cost per request is anomalous.
Uses z-score approach.
"""
if std_dev == 0:
return False
z_score = (current_avg_cost - historical_avg) / std_dev
return abs(z_score) > sensitivity
# Example
historical_avg = 0.0125 # $0.0125 per request
current_avg = 0.0250 # $0.0250 per request
std_dev = 0.002
if detect_cost_anomaly(current_avg, historical_avg, std_dev):
send_alert("Cost anomaly", f"Average cost doubled: ${current_avg:.4f}")
def evaluate_cache_strategy(
hit_rate: float,
expected_rate: float = 0.70,
min_acceptable: float = 0.60
) -> str:
"""
Evaluate if cache is performing as expected.
Returns status: "optimal", "acceptable", "needs_review"
"""
if hit_rate >= expected_rate:
return "optimal"
elif hit_rate >= min_acceptable:
return "acceptable"
else:
return "needs_review"
# Example
cache_hit_rate = 0.55 # 55% hit rate
status = evaluate_cache_strategy(cache_hit_rate)
if status == "needs_review":
send_alert(
"Cache efficiency low",
f"Hit rate {cache_hit_rate:.1%} below threshold. Review cache strategy."
)
import requests
def send_slack_alert(webhook_url: str, message: str, severity: str = "warning"):
"""
Send formatted alert to Slack.
"""
color = "danger" if severity == "critical" else "warning"
payload = {
"attachments": [{
"color": color,
"title": f"LLM Cost Alert - {severity.upper()}",
"text": message,
"fields": [
{"title": "Severity", "value": severity, "short": True},
{"title": "Time", "value": time.strftime("%Y-%m-%d %H:%M:%S"), "short": True}
],
"footer": "Cost Monitoring System"
}]
}
requests.post(webhook_url, json=payload)
def trigger_pagerduty(
api_key: str,
service_key: str,
incident_title: str,
severity: str = "error"
):
"""
Trigger PagerDuty incident for critical cost anomalies.
"""
import pdpyras
session = pdpyras.EventsAPISession(api_key)
session.trigger(
summary=incident_title,
severity=severity,
routing_key=service_key,
source="cost-monitoring",
custom_details={
"alert_type": "cost_anomaly",
"requires_investigation": True
}
)
-- Aggregate costs by feature and model
SELECT
feature,
model,
DATE_TRUNC('hour', timestamp) as hour,
SUM(cost_usd) as total_cost,
SUM(input_tokens) as input_tokens,
SUM(output_tokens) as output_tokens,
COUNT(*) as call_count
FROM llm_telemetry
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY feature, model, hour
ORDER BY total_cost DESC;
-- Identify features with cost spikes
WITH hourly_costs AS (
SELECT
feature,
DATE_TRUNC('hour', timestamp) as hour,
SUM(cost_usd) as hourly_cost,
AVG(cost_usd) as avg_per_call
FROM llm_telemetry
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY feature, hour
),
baselines AS (
SELECT
feature,
AVG(hourly_cost) as baseline_cost,
STDDEV(hourly_cost) as cost_stddev
FROM hourly_costs
GROUP BY feature
)
SELECT
hc.feature,
hc.hour,
hc.hourly_cost,
b.baseline_cost,
(hc.hourly_cost - b.baseline_cost) / NULLIF(b.cost_stddev, 0) as z_score
FROM hourly_costs hc
JOIN baselines b ON hc.feature = b.feature
WHERE hc.hourly_cost > b.baseline_cost + (2 * b.cost_stddev)
ORDER BY z_score DESC;

Symptoms: Token usage 2x above baseline for feature X

Immediate Actions:

  1. Check feature dashboard for recent deployments
  2. Review error logs for retry storms
  3. Verify cache hit rates haven’t dropped
  4. Check for bot traffic or unusual user patterns

Escalation: If volume continues to grow, consider rate limiting or feature flag deactivation

Symptoms: Average cost per call increased from $0.0125 to $0.0250

Immediate Actions:

  1. Compare current prompt version with previous
  2. Check if context window size increased
  3. Verify model hasn’t changed (e.g., accidental fallback to gpt-4o)
  4. Review recent code changes to prompt construction

Escalation: Roll back to previous version if cause isn’t identified within 15 minutes

Symptoms: Cache efficiency dropped from 70% to 35%

Immediate Actions:

  1. Check if cache TTL was modified
  2. Verify cache key generation logic
  3. Review if prompt patterns changed
  4. Check cache infrastructure health

Escalation: Disable caching optimization if hit rate stays below 50% for 1 hour

Real-time cost observability is not optional—it’s essential infrastructure for any production LLM application. The combination of instrumentation, aggregation, anomaly detection, and alerting transforms token spend from a budget risk into a controlled engineering metric.

Start today:

  1. Wrap every LLM call with cost tracking
  2. Stream metrics to your observability platform
  3. Set up basic alerts for volume and cost spikes
  4. Build dashboards to visualize trends
  5. Create runbooks for common scenarios

The cost of monitoring is negligible compared to the cost of a single unmonitored weekend.