Skip to content
GitHubX/TwitterRSS

AI Observability Metrics Glossary

The complete reference for metrics that matter in production AI systems.

Definition: The total cost of a single LLM API call.

Formula:

CPR = (Input Tokens × Input Price) + (Output Tokens × Output Price)

Target: Depends on use case. Support bot: $0.01-0.05. Complex analysis: $0.10-0.50.


Definition: Total cost across all turns of a conversation.

Formula:

CPC = Σ(CPR for each turn)

Note: Grows quadratically due to context accumulation.


Definition: Cost attributed to successful task completions.

Formula:

CPSO = Total Cost / Number of Successful Outcomes

Target: This is your ROI metric. Lower is better.


Definition: Ratio of useful output to total tokens consumed.

Formula:

TER = Output Tokens / Total Tokens

Target: 0.2-0.4 for typical applications. Higher indicates efficient prompts.


Definition: Rate of spending over time.

Formula:

Burn Rate = Total Cost / Time Period

Use: Set alerts for abnormal increases.


Definition: Duration from request submission to first token received.

Formula:

TTFT = Timestamp(First Token) - Timestamp(Request Sent)

Target:

  • Interactive: <500ms
  • Search: <1000ms
  • Background: <3000ms

Definition: Generation speed after first token.

Formula:

TPS = Output Tokens / (Total Time - TTFT)

Target: 30-50 TPS for good streaming experience.


Definition: Total time from user action to complete response.

Formula:

E2E = TTFT + Generation Time + Client Render Time

Definition: Percentile latency distributions.

Formula:

P95 = Value at 95th percentile of latency distribution

Use: P95 is what your slowest users experience. Optimize for this.


Definition: Additional latency on first request after idle.

Formula:

Cold Start = TTFT(First Request) - TTFT(Warm Request)

Target: <500ms for serverless deployments.


Definition: How well the response addresses the query.

Measurement: Semantic similarity or LLM-as-judge scoring.

Scale: 0-1 (higher is better)

Target: >0.8 for production quality.


Definition: Degree to which response is supported by provided context (for RAG).

Measurement:

groundedness = claims_supported_by_context / total_claims

Target: >0.9 for factual applications.


Definition: Whether response contradicts provided context.

Measurement: NLI model or LLM-as-judge.

Scale: 0-1 (1 = no contradictions)

Target: >0.95 for production quality.


Definition: Percentage of responses containing fabricated information.

Formula:

Hallucination Rate = Responses with Hallucinations / Total Responses

Target: <5% for general applications, <1% for critical applications.


Definition: Percentage of test cases that pass quality thresholds.

Formula:

Pass Rate = Passed Test Cases / Total Test Cases

Target: >90% for production readiness. Track trends over time.


Definition: Proportion of retrieved documents that are relevant.

Formula:

Precision = Relevant Retrieved / Total Retrieved

Target: >0.7 (higher means less noise).


Definition: Proportion of relevant documents that were retrieved.

Formula:

Recall = Relevant Retrieved / Total Relevant

Target: >0.8 (higher means less missed information).


Definition: Average of reciprocal ranks of first relevant result.

Formula:

MRR = (1/N) × Σ(1/rank_i)

Target: >0.5 (higher means relevant docs appear earlier).


Definition: How much of retrieved context is actually used in response.

Measurement: Compare response to context overlap.

Target: >0.4 (low utilization suggests over-retrieval).


Definition: Percentage of requests that appear to be injection attempts.

Measurement: Pattern matching + anomaly detection.

Target: Track baseline, alert on increases.


Definition: Percentage of injection attempts that succeed.

Formula:

Success Rate = Successful Injections / Detected Attempts

Target: 0% (any success is a vulnerability).


Definition: Percentage of responses containing PII.

Measurement: PII detection on outputs.

Target: 0% for customer-facing applications.


Definition: Percentage of requests that bypass safety filters.

Measurement: Safety classifier on outputs.

Target: <0.1% after filtering.


Definition: Maximum nesting level of agent spawning.

Target: Set hard limits (typically 2-3 max).


Definition: Number of tool invocations per request.

Target: Set limits based on use case.


Definition: Percentage of agent runs that entered loops.

Formula:

Loop Rate = Runs with Loops / Total Runs

Target: <1% with proper circuit breakers.


Definition: Percentage of agent tasks completed successfully.

Formula:

Success Rate = Successful Completions / Total Attempts

Target: >90% for production quality.


Definition: Total cost of an agent execution including all sub-agents.

Formula:

Cost = Σ(All LLM calls + Tool costs)

Target: Set budget limits per run.


Definition: Percentage of requests that complete without error.

Formula:

Success Rate = (Total - Errors) / Total

Target: >99.5% for production systems.


Definition: Percentage of requests that are rate limited.

Formula:

Hit Rate = Rate Limited Requests / Total Requests

Target: <1% (higher indicates capacity issues).


Definition: Percentage of requests served from cache.

Formula:

Hit Rate = Cached Responses / Total Requests

Target: 20-50% for typical applications.


Definition: Breakdown of errors by category.

Categories:

  • rate_limit - Provider throttling
  • timeout - Request timeout
  • invalid_request - Bad input
  • model_error - Model failure
  • internal - Your application error

Target: Track each type separately. Alert on anomalies.


Metric TypeTypical Window
CostDaily, Weekly, Monthly
LatencyReal-time, Hourly
QualityPer-release, Weekly
SecurityReal-time
OperationalReal-time, Hourly
  • Model
  • Feature / endpoint
  • User segment
  • Geographic region
  • Time of day

Related guides: