AI Observability Metrics Glossary
AI Observability Metrics Glossary
Section titled “AI Observability Metrics Glossary”The complete reference for metrics that matter in production AI systems.
Cost Metrics
Section titled “Cost Metrics”Cost Per Request (CPR)
Section titled “Cost Per Request (CPR)”Definition: The total cost of a single LLM API call.
Formula:
CPR = (Input Tokens × Input Price) + (Output Tokens × Output Price)Target: Depends on use case. Support bot: $0.01-0.05. Complex analysis: $0.10-0.50.
Cost Per Conversation (CPC)
Section titled “Cost Per Conversation (CPC)”Definition: Total cost across all turns of a conversation.
Formula:
CPC = Σ(CPR for each turn)Note: Grows quadratically due to context accumulation.
Cost Per Successful Outcome (CPSO)
Section titled “Cost Per Successful Outcome (CPSO)”Definition: Cost attributed to successful task completions.
Formula:
CPSO = Total Cost / Number of Successful OutcomesTarget: This is your ROI metric. Lower is better.
Token Efficiency Ratio (TER)
Section titled “Token Efficiency Ratio (TER)”Definition: Ratio of useful output to total tokens consumed.
Formula:
TER = Output Tokens / Total TokensTarget: 0.2-0.4 for typical applications. Higher indicates efficient prompts.
Burn Rate
Section titled “Burn Rate”Definition: Rate of spending over time.
Formula:
Burn Rate = Total Cost / Time PeriodUse: Set alerts for abnormal increases.
Latency Metrics
Section titled “Latency Metrics”Time to First Token (TTFT)
Section titled “Time to First Token (TTFT)”Definition: Duration from request submission to first token received.
Formula:
TTFT = Timestamp(First Token) - Timestamp(Request Sent)Target:
- Interactive: <500ms
- Search: <1000ms
- Background: <3000ms
Tokens Per Second (TPS)
Section titled “Tokens Per Second (TPS)”Definition: Generation speed after first token.
Formula:
TPS = Output Tokens / (Total Time - TTFT)Target: 30-50 TPS for good streaming experience.
End-to-End Latency (E2E)
Section titled “End-to-End Latency (E2E)”Definition: Total time from user action to complete response.
Formula:
E2E = TTFT + Generation Time + Client Render TimeP50 / P95 / P99 Latency
Section titled “P50 / P95 / P99 Latency”Definition: Percentile latency distributions.
Formula:
P95 = Value at 95th percentile of latency distributionUse: P95 is what your slowest users experience. Optimize for this.
Cold Start Time
Section titled “Cold Start Time”Definition: Additional latency on first request after idle.
Formula:
Cold Start = TTFT(First Request) - TTFT(Warm Request)Target: <500ms for serverless deployments.
Quality Metrics
Section titled “Quality Metrics”Relevance Score
Section titled “Relevance Score”Definition: How well the response addresses the query.
Measurement: Semantic similarity or LLM-as-judge scoring.
Scale: 0-1 (higher is better)
Target: >0.8 for production quality.
Groundedness Score
Section titled “Groundedness Score”Definition: Degree to which response is supported by provided context (for RAG).
Measurement:
groundedness = claims_supported_by_context / total_claimsTarget: >0.9 for factual applications.
Faithfulness Score
Section titled “Faithfulness Score”Definition: Whether response contradicts provided context.
Measurement: NLI model or LLM-as-judge.
Scale: 0-1 (1 = no contradictions)
Target: >0.95 for production quality.
Hallucination Rate
Section titled “Hallucination Rate”Definition: Percentage of responses containing fabricated information.
Formula:
Hallucination Rate = Responses with Hallucinations / Total ResponsesTarget: <5% for general applications, <1% for critical applications.
Eval Pass Rate
Section titled “Eval Pass Rate”Definition: Percentage of test cases that pass quality thresholds.
Formula:
Pass Rate = Passed Test Cases / Total Test CasesTarget: >90% for production readiness. Track trends over time.
RAG-Specific Metrics
Section titled “RAG-Specific Metrics”Retrieval Precision
Section titled “Retrieval Precision”Definition: Proportion of retrieved documents that are relevant.
Formula:
Precision = Relevant Retrieved / Total RetrievedTarget: >0.7 (higher means less noise).
Retrieval Recall
Section titled “Retrieval Recall”Definition: Proportion of relevant documents that were retrieved.
Formula:
Recall = Relevant Retrieved / Total RelevantTarget: >0.8 (higher means less missed information).
Mean Reciprocal Rank (MRR)
Section titled “Mean Reciprocal Rank (MRR)”Definition: Average of reciprocal ranks of first relevant result.
Formula:
MRR = (1/N) × Σ(1/rank_i)Target: >0.5 (higher means relevant docs appear earlier).
Context Utilization
Section titled “Context Utilization”Definition: How much of retrieved context is actually used in response.
Measurement: Compare response to context overlap.
Target: >0.4 (low utilization suggests over-retrieval).
Security Metrics
Section titled “Security Metrics”Injection Attempt Rate
Section titled “Injection Attempt Rate”Definition: Percentage of requests that appear to be injection attempts.
Measurement: Pattern matching + anomaly detection.
Target: Track baseline, alert on increases.
Injection Success Rate
Section titled “Injection Success Rate”Definition: Percentage of injection attempts that succeed.
Formula:
Success Rate = Successful Injections / Detected AttemptsTarget: 0% (any success is a vulnerability).
PII Leak Rate
Section titled “PII Leak Rate”Definition: Percentage of responses containing PII.
Measurement: PII detection on outputs.
Target: 0% for customer-facing applications.
Jailbreak Rate
Section titled “Jailbreak Rate”Definition: Percentage of requests that bypass safety filters.
Measurement: Safety classifier on outputs.
Target: <0.1% after filtering.
Agent Metrics
Section titled “Agent Metrics”Agent Depth
Section titled “Agent Depth”Definition: Maximum nesting level of agent spawning.
Target: Set hard limits (typically 2-3 max).
Tool Call Count
Section titled “Tool Call Count”Definition: Number of tool invocations per request.
Target: Set limits based on use case.
Loop Detection Rate
Section titled “Loop Detection Rate”Definition: Percentage of agent runs that entered loops.
Formula:
Loop Rate = Runs with Loops / Total RunsTarget: <1% with proper circuit breakers.
Agent Success Rate
Section titled “Agent Success Rate”Definition: Percentage of agent tasks completed successfully.
Formula:
Success Rate = Successful Completions / Total AttemptsTarget: >90% for production quality.
Cost Per Agent Run
Section titled “Cost Per Agent Run”Definition: Total cost of an agent execution including all sub-agents.
Formula:
Cost = Σ(All LLM calls + Tool costs)Target: Set budget limits per run.
Operational Metrics
Section titled “Operational Metrics”Request Success Rate
Section titled “Request Success Rate”Definition: Percentage of requests that complete without error.
Formula:
Success Rate = (Total - Errors) / TotalTarget: >99.5% for production systems.
Rate Limit Hit Rate
Section titled “Rate Limit Hit Rate”Definition: Percentage of requests that are rate limited.
Formula:
Hit Rate = Rate Limited Requests / Total RequestsTarget: <1% (higher indicates capacity issues).
Cache Hit Rate
Section titled “Cache Hit Rate”Definition: Percentage of requests served from cache.
Formula:
Hit Rate = Cached Responses / Total RequestsTarget: 20-50% for typical applications.
Error Rate by Type
Section titled “Error Rate by Type”Definition: Breakdown of errors by category.
Categories:
rate_limit- Provider throttlingtimeout- Request timeoutinvalid_request- Bad inputmodel_error- Model failureinternal- Your application error
Target: Track each type separately. Alert on anomalies.
Aggregation Guidance
Section titled “Aggregation Guidance”Time Windows
Section titled “Time Windows”| Metric Type | Typical Window |
|---|---|
| Cost | Daily, Weekly, Monthly |
| Latency | Real-time, Hourly |
| Quality | Per-release, Weekly |
| Security | Real-time |
| Operational | Real-time, Hourly |
Dimensions to Slice By
Section titled “Dimensions to Slice By”- Model
- Feature / endpoint
- User segment
- Geographic region
- Time of day
Related guides: