Monitoring Token Spend in Real-Time: Building Cost Observability

A single unmonitored production feature can burn through $10,000 in a weekend. One engineering team discovered this when their new “smart reply” feature—launched on Friday—generated 4.2 million output tokens by Monday morning. Without real-time monitoring, they had no warning until the invoice arrived. This guide will show you how to build cost observability that catches these surprises before they become budget disasters.

Why Real-Time Cost Monitoring Matters

Traditional monitoring focuses on latency, uptime, and error rates—metrics that impact user experience. But in the age of LLMs, cost per token is equally critical. A 50ms latency improvement means nothing if it costs you $50,000 more per month.

The challenge is that token costs are invisible until billed. Unlike database queries where you can see row counts and execution plans, LLM calls are black boxes. You get a response, but the token burn happens behind the API curtain. This invisibility creates several risks:

Feature creep: A prompt that starts at 500 tokens can balloon to 2,000 tokens as engineers add context, examples, and instructions
User behavior spikes: A viral feature or bot traffic can multiply your volume 10x overnight
Model upgrades: Switching from GPT-4o-mini to GPT-4o increases costs 33x for the same request volume
Retry storms: Poor error handling can multiply costs by 3-5x through redundant calls

Real-time observability transforms token spend from an accounting surprise into a manageable engineering metric. You can attribute costs to features, detect anomalies instantly, and make informed tradeoffs between capability and cost.

Core Components of Cost Observability

Building effective monitoring requires instrumenting four key layers of your LLM stack:

1. API Call Instrumentation

Every LLM request must be wrapped with tracking. This is your ground truth—the raw data that feeds all aggregation and analysis.

What to capture per call:

Provider (OpenAI, Anthropic, etc.)
Model used
Input token count (prompt + context)
Output token count (completion)
Cache hit/miss status (if using prompt caching)
Request timestamp
Cost in USD
Feature/endpoint identifier
User/team ID (for chargeback)

2. Feature-Level Aggregation

Raw API logs are too granular for business decisions. You need to aggregate by feature or product surface area to understand which capabilities are driving spend.

Aggregation dimensions:

Feature name (e.g., “summarization”, “code-review”, “chat-assistant”)
Endpoint or route (e.g., /api/v1/chat, /api/v1/summarize)
Environment (dev, staging, prod)
Model family (GPT-4, Claude 3.5, etc.)

This allows you to answer questions like: “Is our new RAG feature costing us more than it’s worth?” or “Which customer segment is burning the most tokens?“

3. Cost Anomaly Detection

Static dashboards are reactive. You need proactive anomaly detection that alerts you when spending deviates from expected patterns.

Key anomaly types:

Volume spikes: 2x increase in requests within an hour
Cost per request increase: Average cost per call jumps 50%+
Model drift: Unexpected model usage (e.g., production falling back to expensive models)
Cache efficiency drops: Cache hit rate drops below expected threshold

4. Alerting and Escalation

Detection without action is useless. Alerts must be:

Timely: Delivered within minutes, not hours
Actionable: Include enough context to investigate immediately
Escalating: Different thresholds for different severity levels

Practical Implementation

Instrument your LLM calls with a wrapper that captures token counts and costs
Stream data to your observability platform (e.g., Datadog, CloudWatch, or a custom data warehouse)
Define cost baselines by aggregating historical data by feature and model
Set up anomaly detection rules with thresholds for volume, cost-per-request, and cache efficiency
Configure alerting channels (Slack, PagerDuty, email) with severity-based routing
Build dashboards that visualize spend trends and top cost drivers
Create runbooks for common alert scenarios

Why This Matters

The difference between a manageable AI feature and a budget disaster often comes down to visibility. When you can see token burn in real-time, you shift from reactive cost accounting to proactive cost engineering.

Consider the economics: A single gpt-4o request processing 1,000 input tokens and generating 500 output tokens costs approximately $0.0125. Scale that to 100,000 requests per day, and you’re spending $1,250 daily—$37,500 per month. A 2x spike in output tokens (to 1,000) doubles that to $75,000/month. Without monitoring, you only discover this when the invoice arrives.

Real-time monitoring enables three critical capabilities:

1. Immediate Cost Attribution When your “smart reply” feature starts burning through tokens, you need to know which feature, which team, and which prompt version is responsible. This requires tracking at the API level with feature tags, not just provider invoices.

2. Automated Anomaly Detection A sudden 3x increase in token usage at 2 AM on Saturday should trigger an alert before Monday’s standup. Effective monitoring compares current spend against historical baselines and flags deviations immediately.

3. Informed Model Selection The pricing gap between models is dramatic. gpt-4o-mini costs 33x less than gpt-4o for input tokens ($0.15 vs $5 per 1M). Real-time cost tracking lets you validate whether the quality improvement justifies the expense for each use case.

Common Pitfalls

Even well-intentioned teams make these mistakes that undermine cost observability:

Pitfall 1: Relying on Provider Billing Dashboards Provider dashboards show total spend but lack granularity. You can’t see which feature drove cost, which user triggered it, or whether it was a retry storm. By the time you see the spike, the damage is done.

Pitfall 2: Sampling Instead of Full Instrumentation Some teams log only 1% of requests to “save on logging costs.” This destroys anomaly detection—you’ll miss the 100x spike that hits the unlogged 99%. Every LLM call must be tracked.

Pitfall 3: Ignoring Input Token Costs Output tokens get attention because they’re visible in responses. But input tokens—especially with large context windows or RAG systems—can dominate costs. A 10,000-token system prompt multiplied across thousands of requests adds up fast.

Pitfall 4: Static Thresholds Setting a fixed daily budget alert ($1,000/day) ignores natural traffic patterns. Tuesday might be 3x higher than Sunday. Effective monitoring uses dynamic baselines that account for time-of-day and day-of-week patterns.

Pitfall 5: No Cache Hit Tracking Prompt caching can reduce costs by 50-90%, but only if you measure it. Teams that don’t track cache hit rates can’t optimize their cache strategy or verify they’re getting the expected savings.

Quick Reference

Pricing Reference (Verified as of Dec 2025)

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Context Window
OpenAI gpt-4o	$5.00	$15.00	128,000
OpenAI gpt-4o-mini	$0.15	$0.60	128,000
Anthropic claude-3-5-sonnet	$3.00	$15.00	200,000
Anthropic haiku-3.5	$1.25	$5.00	200,000

Source: OpenAI Pricing, Anthropic Models

Cost Calculation Formula

Cost per Request = (Input Tokens × Input Rate) + (Output Tokens × Output Rate)

Where rates are per-token (divide per-1M rates by 1,000,000)

Example: 1,000 input tokens + 500 output tokens with gpt-4o:

(1,000 × $5/1M) + (500 × $15/1M) = $0.005 + $0.0075 = $0.0125 per request

Alert Threshold Guidelines

Metric	Warning Threshold	Critical Threshold	Action
Volume Spike	150% of baseline	250% of baseline	Investigate immediately
Cost per Request	+50% vs average	+100% vs average	Check prompt changes
Cache Hit Rate	less than 60%	less than 40%	Review cache strategy
Model Drift	Any unexpected model usage	Fallback to expensive model	Audit routing logic

Cost monitoring dashboard template + alerting rules

Interactive widget derived from “Monitoring Token Spend in Real-Time: Building Cost Observability” that lets readers explore cost monitoring dashboard template + alerting rules.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Summary

Effective cost observability transforms token spend from an unpredictable expense into a controlled engineering metric. The three-layer approach—instrumentation, aggregation, and anomaly detection—provides the visibility needed to prevent budget disasters and optimize spend.

Key takeaways:

Instrument every call: Missing data cannot be reconstructed
Track input and output: Both contribute significantly to costs
Use dynamic baselines: Static thresholds miss pattern-based anomalies
Attribute to features: Know which capabilities drive spend
Alert immediately: Hours matter when costs are compounding

The pricing data shows dramatic differences between models—gpt-4o-mini costs 33x less than gpt-4o for input tokens. Real-time monitoring validates whether premium models deliver proportional value for each use case.

Start with basic instrumentation today. You can’t optimize what you can’t measure, and in the world of LLMs, measurement must happen in real-time, not after the invoice arrives.

Implementation Guides:

FinOps: Building Generative AI Cost Trackers - Comprehensive framework for AI cost tracking with database schemas
LangSmith Cost Tracking - Automated token and cost tracking for LangChain applications
Claude Agent SDK: Tracking Costs - Detailed usage tracking for Claude Agent SDK

Code Example

The following examples show how to instrument LLM calls with cost tracking. This wrapper captures all necessary metrics and streams them to an observability backend.

Python
TypeScript

import time
import uuid
from typing import Dict, Any, Optional
import requests

class LLMCostTracker:
    """
    Tracks token usage and costs for LLM API calls.
    Streams data to observability backend.
    """

    def __init__(self, observability_endpoint: str, api_key: str):
        self.endpoint = observability_endpoint
        self.api_key = api_key
        self.session = requests.Session()

    def track_call(
        self,
        provider: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        feature: str,
        user_id: Optional[str] = None,
        cache_hit: bool = False,
        cache_tokens: int = 0
    ) -> Dict[str, Any]:
        """
        Track a single LLM call with cost calculation.
        """
        # Pricing per 1M tokens (as of Dec 2025)
        PRICING = {
            "openai": {
                "gpt-4o": {"input": 5.00, "output": 15.00},
                "gpt-4o-mini": {"input": 0.15, "output": 0.60},
            },
            "anthropic": {
                "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
                "haiku-3.5": {"input": 1.25, "output": 5.00},
            }
        }

        # Calculate costs
        provider_pricing = PRICING.get(provider, {})
        model_pricing = provider_pricing.get(model, {"input": 0.0, "output": 0.0})

        input_cost = (input_tokens / 1_000_000) * model_pricing["input"]
        output_cost = (output_tokens / 1_000_000) * model_pricing["output"]

        # Cache savings calculation
        cache_savings = 0
        if cache_hit and cache_tokens > 0:
            cache_savings = (cache_tokens / 1_000_000) * model_pricing["input"]

        total_cost = input_cost + output_cost - cache_savings

        # Build telemetry payload
        telemetry_data = {
            "trace_id": str(uuid.uuid4()),
            "timestamp": time.time(),
            "provider": provider,
            "model": model,
            "feature": feature,
            "user_id": user_id,
            "tokens": {
                "input": input_tokens,
                "output": output_tokens,
                "cache_hit": cache_hit,
                "cache_tokens": cache_tokens,
            },
            "costs": {
                "input_usd": round(input_cost, 6),
                "output_usd": round(output_cost, 6),
                "cache_savings_usd": round(cache_savings, 6),
                "total_usd": round(total_cost, 6),
            },
            "metadata": {
                "environment": "production",
                "version": "1.0.0",
            }
        }

        # Stream to observability backend (async in production)
        self._send_to_observability(telemetry_data)

        return telemetry_data

    def _send_to_observability(self, data: Dict[str, Any]) -> None:
        """
        Send telemetry data to observability platform.
        In production, use async queue or batch processing.
        """
        try:
            response = self.session.post(
                f"{self.endpoint}/metrics/llm-costs",
                json=data,
                headers={"Authorization": f"Bearer {self.api_key}"},
                timeout=2
            )
            if response.status_code != 200:
                print(f"Warning: Failed to send telemetry: {response.status_code}")
        except Exception as e:
            # Fail silently in production - don't break the app
            print(f"Telemetry error: {e}")

# Usage example
if __name__ == "__main__":
    tracker = LLMCostTracker(
        observability_endpoint="https://observability.example.com",
        api_key="your-api-key"
    )

    # Simulate an LLM call
    result = tracker.track_call(
        provider="openai",
        model="gpt-4o-mini",
        input_tokens=1500,
        output_tokens=450,
        feature="smart-reply",
        user_id="user_12345",
        cache_hit=False
    )

    print(f"Tracked call: ${result['costs']['total_usd']:.6f}")

interface LLMCallData {
  provider: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  feature: string;
  userId?: string;
  cacheHit?: boolean;
  cacheTokens?: number;
}

interface Pricing {
  input: number;
  output: number;
}

interface PricingMap {
  [provider: string]: {
    [model: string]: Pricing;
  };
}

interface TelemetryResult {
  traceId: string;
  timestamp: number;
  costs: {
    inputUsd: number;
    outputUsd: number;
    cacheSavingsUsd: number;
    totalUsd: number;
  };
}

class LLMCostTracker {
  private endpoint: string;
  private apiKey: string;

  constructor(endpoint: string, apiKey: string) {
    this.endpoint = endpoint;
    this.apiKey = apiKey;
  }

  private readonly PRICING: PricingMap = {
    openai: {
      "gpt-4o": { input: 5.00, output: 15.00 },
      "gpt-4o-mini": { input: 0.15, output: 0.60 },
    },
    anthropic: {
      "claude-3-5-sonnet": { input: 3.00, output: 15.00 },
      "haiku-3.5": { input: 1.25, output: 5.00 },
    }
  };

  public trackCall(data: LLMCallData): TelemetryResult {
    const pricing = this.PRICING[data.provider]?.[data.model]
      || { input: 0, output: 0 };

    const inputCost = (data.inputTokens / 1_000_000) * pricing.input;
    const outputCost = (data.outputTokens / 1_000_000) * pricing.output;

    const cacheSavings = data.cacheHit && data.cacheTokens
      ? (data.cacheTokens / 1_000_000) * pricing.input
      : 0;

    const totalCost = inputCost + outputCost - cacheSavings;

    const telemetry = {
      traceId: crypto.randomUUID(),
      timestamp: Date.now(),
      provider: data.provider,
      model: data.model,
      feature: data.feature,
      userId: data.userId,
      tokens: {
        input: data.inputTokens,
        output: data.outputTokens,
        cacheHit: data.cacheHit || false,
        cacheTokens: data.cacheTokens || 0,
      },
      costs: {
        inputUsd: Math.round(inputCost * 1_000_000) / 1_000_000,
        outputUsd: Math.round(outputCost * 1_000_000) / 1_000_000,
        cacheSavingsUsd: Math.round(cacheSavings * 1_000_000) / 1_000_000,
        totalUsd: Math.round(totalCost * 1_000_000) / 1_000_000,
      },
      metadata: {
        environment: "production",
        version: "1.0.0",
      }
    };

    this.sendToObservability(telemetry);

    return telemetry;
  }

  private async sendToObservability(data: any): Promise<void> {
    try {
      const response = await fetch(`${this.endpoint}/metrics/llm-costs`, {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Authorization": `Bearer ${this.apiKey}`
        },
        body: JSON.stringify(data),
        signal: AbortSignal.timeout(2000)
      });

      if (!response.ok) {
        console.warn(`Telemetry failed: ${response.status}`);
      }
    } catch (error) {
      console.error("Telemetry error:", error);
    }
  }
}

// Usage example
const tracker = new LLMCostTracker(
  "https://observability.example.com",
  "your-api-key"
);

const result = tracker.trackCall({
  provider: "openai",
  model: "gpt-4o-mini",
  inputTokens: 1500,
  outputTokens: 450,
  feature: "smart-reply",
  userId: "user_12345",
  cacheHit: false
});

console.log(`Tracked call: $${result.costs.totalUsd.toFixed(6)}`);

Integration with Observability Platforms

Datadog Integration

from datadog import initialize, api
import time

class DatadogCostReporter:
    def __init__(self, api_key: str, app_key: str):
        initialize(api_key=api_key, app_key=app_key)

    def report_cost(self, feature: str, cost_usd: float, tokens: int):
        """Send cost metrics to Datadog"""
        api.Metric.send([
            {
                "metric": "llm.cost_usd",
                "points": [(time.time(), cost_usd)],
                "tags": [f"feature:{feature}", "env:production"]
            },
            {
                "metric": "llm.tokens",
                "points": [(time.time(), tokens)],
                "tags": [f"feature:{feature}", "env:production"]
            }
        ])

CloudWatch Integration

import boto3
import json

class CloudWatchCostReporter:
    def __init__(self, region: str = "us-east-1"):
        self.logs = boto3.client("logs", region_name=region)
        self.log_group = "/aws/llm/cost-tracking"

    def log_call(self, data: dict):
        """Stream cost data to CloudWatch Logs"""
        self.logs.put_log_events(
            logGroupName=self.log_group,
            logStreamName="llm-metrics",
            logEvents=[{
                "timestamp": int(time.time() * 1000),
                "message": json.dumps(data)
            }]
        )

Anomaly Detection Rules

Volume Spike Detection

def detect_volume_spike(current_hour: int, baseline: int, threshold: float = 1.5) -> bool:
    """
    Detect if current token volume exceeds baseline by threshold.
    Returns True if spike detected.
    """
    return current_hour > (baseline * threshold)

# Example usage
baseline_tokens = 50000  # Historical average per hour
current_tokens = 125000   # Current hour

if detect_volume_spike(current_tokens, baseline_tokens):
    send_alert("Volume spike detected", f"Token usage at {current_tokens} vs baseline {baseline_tokens}")

Cost Per Request Anomaly

def detect_cost_anomaly(
    current_avg_cost: float,
    historical_avg: float,
    std_dev: float,
    sensitivity: float = 2.0
) -> bool:
    """
    Detect if average cost per request is anomalous.
    Uses z-score approach.
    """
    if std_dev == 0:
        return False

    z_score = (current_avg_cost - historical_avg) / std_dev
    return abs(z_score) > sensitivity

# Example
historical_avg = 0.0125  # $0.0125 per request
current_avg = 0.0250     # $0.0250 per request
std_dev = 0.002

if detect_cost_anomaly(current_avg, historical_avg, std_dev):
    send_alert("Cost anomaly", f"Average cost doubled: ${current_avg:.4f}")

Cache Efficiency Monitoring

def evaluate_cache_strategy(
    hit_rate: float,
    expected_rate: float = 0.70,
    min_acceptable: float = 0.60
) -> str:
    """
    Evaluate if cache is performing as expected.
    Returns status: "optimal", "acceptable", "needs_review"
    """
    if hit_rate >= expected_rate:
        return "optimal"
    elif hit_rate >= min_acceptable:
        return "acceptable"
    else:
        return "needs_review"

# Example
cache_hit_rate = 0.55  # 55% hit rate
status = evaluate_cache_strategy(cache_hit_rate)

if status == "needs_review":
    send_alert(
        "Cache efficiency low",
        f"Hit rate {cache_hit_rate:.1%} below threshold. Review cache strategy."
    )

Alerting Configuration

Slack Integration

import requests

def send_slack_alert(webhook_url: str, message: str, severity: str = "warning"):
    """
    Send formatted alert to Slack.
    """
    color = "danger" if severity == "critical" else "warning"

    payload = {
        "attachments": [{
            "color": color,
            "title": f"LLM Cost Alert - {severity.upper()}",
            "text": message,
            "fields": [
                {"title": "Severity", "value": severity, "short": True},
                {"title": "Time", "value": time.strftime("%Y-%m-%d %H:%M:%S"), "short": True}
            ],
            "footer": "Cost Monitoring System"
        }]
    }

    requests.post(webhook_url, json=payload)

PagerDuty Escalation

def trigger_pagerduty(
    api_key: str,
    service_key: str,
    incident_title: str,
    severity: str = "error"
):
    """
    Trigger PagerDuty incident for critical cost anomalies.
    """
    import pdpyras

    session = pdpyras.EventsAPISession(api_key)
    session.trigger(
        summary=incident_title,
        severity=severity,
        routing_key=service_key,
        source="cost-monitoring",
        custom_details={
            "alert_type": "cost_anomaly",
            "requires_investigation": True
        }
    )

Dashboard Queries

SQL for Cost Attribution

-- Aggregate costs by feature and model
SELECT
    feature,
    model,
    DATE_TRUNC('hour', timestamp) as hour,
    SUM(cost_usd) as total_cost,
    SUM(input_tokens) as input_tokens,
    SUM(output_tokens) as output_tokens,
    COUNT(*) as call_count
FROM llm_telemetry
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY feature, model, hour
ORDER BY total_cost DESC;

Anomaly Detection Query

-- Identify features with cost spikes
WITH hourly_costs AS (
    SELECT
        feature,
        DATE_TRUNC('hour', timestamp) as hour,
        SUM(cost_usd) as hourly_cost,
        AVG(cost_usd) as avg_per_call
    FROM llm_telemetry
    WHERE timestamp >= NOW() - INTERVAL '7 days'
    GROUP BY feature, hour
),
baselines AS (
    SELECT
        feature,
        AVG(hourly_cost) as baseline_cost,
        STDDEV(hourly_cost) as cost_stddev
    FROM hourly_costs
    GROUP BY feature
)
SELECT
    hc.feature,
    hc.hour,
    hc.hourly_cost,
    b.baseline_cost,
    (hc.hourly_cost - b.baseline_cost) / NULLIF(b.cost_stddev, 0) as z_score
FROM hourly_costs hc
JOIN baselines b ON hc.feature = b.feature
WHERE hc.hourly_cost > b.baseline_cost + (2 * b.cost_stddev)
ORDER BY z_score DESC;

Runbook Templates

Alert: Volume Spike Detected

Symptoms: Token usage 2x above baseline for feature X

Immediate Actions:

Check feature dashboard for recent deployments
Review error logs for retry storms
Verify cache hit rates haven’t dropped
Check for bot traffic or unusual user patterns

Escalation: If volume continues to grow, consider rate limiting or feature flag deactivation

Alert: Cost Per Request Doubled

Symptoms: Average cost per call increased from $0.0125 to $0.0250

Immediate Actions:

Compare current prompt version with previous
Check if context window size increased
Verify model hasn’t changed (e.g., accidental fallback to gpt-4o)
Review recent code changes to prompt construction

Escalation: Roll back to previous version if cause isn’t identified within 15 minutes

Alert: Cache Hit Rate Below 40%

Symptoms: Cache efficiency dropped from 70% to 35%

Immediate Actions:

Check if cache TTL was modified
Verify cache key generation logic
Review if prompt patterns changed
Check cache infrastructure health

Escalation: Disable caching optimization if hit rate stays below 50% for 1 hour

Summary

Real-time cost observability is not optional—it’s essential infrastructure for any production LLM application. The combination of instrumentation, aggregation, anomaly detection, and alerting transforms token spend from a budget risk into a controlled engineering metric.

Start today:

Wrap every LLM call with cost tracking
Stream metrics to your observability platform
Set up basic alerts for volume and cost spikes
Build dashboards to visualize trends
Create runbooks for common scenarios

The cost of monitoring is negligible compared to the cost of a single unmonitored weekend.

Monitoring Token Spend in Real-Time: Building Cost Observability

Monitoring Token Spend in Real-Time: Building Cost Observability

Why Real-Time Cost Monitoring Matters

Core Components of Cost Observability

1. API Call Instrumentation

2. Feature-Level Aggregation

3. Cost Anomaly Detection

4. Alerting and Escalation

Practical Implementation

Why This Matters

Common Pitfalls

Quick Reference

Pricing Reference (Verified as of Dec 2025)

Cost Calculation Formula

Alert Threshold Guidelines

Widget

Summary

Related Resources

Code Example

Integration with Observability Platforms

Datadog Integration

CloudWatch Integration

Anomaly Detection Rules

Volume Spike Detection

Cost Per Request Anomaly

Cache Efficiency Monitoring

Alerting Configuration

Slack Integration

PagerDuty Escalation

Dashboard Queries

SQL for Cost Attribution

Anomaly Detection Query

Runbook Templates

Alert: Volume Spike Detected

Alert: Cost Per Request Doubled

Alert: Cache Hit Rate Below 40%

Summary