AI Observability Metrics Glossary

The complete reference for metrics that matter in production AI systems.

Cost Metrics

Cost Per Request (CPR)

Definition: The total cost of a single LLM API call.

Formula:

CPR = (Input Tokens × Input Price) + (Output Tokens × Output Price)

Target: Depends on use case. Support bot: $0.01-0.05. Complex analysis: $0.10-0.50.

Cost Per Conversation (CPC)

Definition: Total cost across all turns of a conversation.

Formula:

CPC = Σ(CPR for each turn)

Note: Grows quadratically due to context accumulation.

Cost Per Successful Outcome (CPSO)

Definition: Cost attributed to successful task completions.

Formula:

CPSO = Total Cost / Number of Successful Outcomes

Target: This is your ROI metric. Lower is better.

Token Efficiency Ratio (TER)

Definition: Ratio of useful output to total tokens consumed.

Formula:

TER = Output Tokens / Total Tokens

Target: 0.2-0.4 for typical applications. Higher indicates efficient prompts.

Burn Rate

Definition: Rate of spending over time.

Formula:

Burn Rate = Total Cost / Time Period

Use: Set alerts for abnormal increases.

Latency Metrics

Time to First Token (TTFT)

Definition: Duration from request submission to first token received.

Formula:

TTFT = Timestamp(First Token) - Timestamp(Request Sent)

Target:

Interactive: <500ms
Search: <1000ms
Background: <3000ms

Tokens Per Second (TPS)

Definition: Generation speed after first token.

Formula:

TPS = Output Tokens / (Total Time - TTFT)

Target: 30-50 TPS for good streaming experience.

End-to-End Latency (E2E)

Definition: Total time from user action to complete response.

Formula:

E2E = TTFT + Generation Time + Client Render Time

P50 / P95 / P99 Latency

Definition: Percentile latency distributions.

Formula:

P95 = Value at 95th percentile of latency distribution

Use: P95 is what your slowest users experience. Optimize for this.

Cold Start Time

Definition: Additional latency on first request after idle.

Formula:

Cold Start = TTFT(First Request) - TTFT(Warm Request)

Target: <500ms for serverless deployments.

Quality Metrics

Relevance Score

Definition: How well the response addresses the query.

Measurement: Semantic similarity or LLM-as-judge scoring.

Scale: 0-1 (higher is better)

Target: >0.8 for production quality.

Groundedness Score

Definition: Degree to which response is supported by provided context (for RAG).

Measurement:

groundedness = claims_supported_by_context / total_claims

Target: >0.9 for factual applications.

Faithfulness Score

Definition: Whether response contradicts provided context.

Measurement: NLI model or LLM-as-judge.

Scale: 0-1 (1 = no contradictions)

Target: >0.95 for production quality.

Hallucination Rate

Definition: Percentage of responses containing fabricated information.

Formula:

Hallucination Rate = Responses with Hallucinations / Total Responses

Target: <5% for general applications, <1% for critical applications.

Eval Pass Rate

Definition: Percentage of test cases that pass quality thresholds.

Formula:

Pass Rate = Passed Test Cases / Total Test Cases

Target: >90% for production readiness. Track trends over time.

RAG-Specific Metrics

Retrieval Precision

Definition: Proportion of retrieved documents that are relevant.

Formula:

Precision = Relevant Retrieved / Total Retrieved

Target: >0.7 (higher means less noise).

Retrieval Recall

Definition: Proportion of relevant documents that were retrieved.

Formula:

Recall = Relevant Retrieved / Total Relevant

Target: >0.8 (higher means less missed information).

Mean Reciprocal Rank (MRR)

Definition: Average of reciprocal ranks of first relevant result.

Formula:

MRR = (1/N) × Σ(1/rank_i)

Target: >0.5 (higher means relevant docs appear earlier).

Context Utilization

Definition: How much of retrieved context is actually used in response.

Measurement: Compare response to context overlap.

Target: >0.4 (low utilization suggests over-retrieval).

Security Metrics

Injection Attempt Rate

Definition: Percentage of requests that appear to be injection attempts.

Measurement: Pattern matching + anomaly detection.

Target: Track baseline, alert on increases.

Injection Success Rate

Definition: Percentage of injection attempts that succeed.

Formula:

Success Rate = Successful Injections / Detected Attempts

Target: 0% (any success is a vulnerability).

PII Leak Rate

Definition: Percentage of responses containing PII.

Measurement: PII detection on outputs.

Target: 0% for customer-facing applications.

Jailbreak Rate

Definition: Percentage of requests that bypass safety filters.

Measurement: Safety classifier on outputs.

Target: <0.1% after filtering.

Agent Metrics

Agent Depth

Definition: Maximum nesting level of agent spawning.

Target: Set hard limits (typically 2-3 max).

Tool Call Count

Definition: Number of tool invocations per request.

Target: Set limits based on use case.

Loop Detection Rate

Definition: Percentage of agent runs that entered loops.

Formula:

Loop Rate = Runs with Loops / Total Runs

Target: <1% with proper circuit breakers.

Agent Success Rate

Definition: Percentage of agent tasks completed successfully.

Formula:

Success Rate = Successful Completions / Total Attempts

Target: >90% for production quality.

Cost Per Agent Run

Definition: Total cost of an agent execution including all sub-agents.

Formula:

Cost = Σ(All LLM calls + Tool costs)

Target: Set budget limits per run.

Operational Metrics

Request Success Rate

Definition: Percentage of requests that complete without error.

Formula:

Success Rate = (Total - Errors) / Total

Target: >99.5% for production systems.

Rate Limit Hit Rate

Definition: Percentage of requests that are rate limited.

Formula:

Hit Rate = Rate Limited Requests / Total Requests

Target: <1% (higher indicates capacity issues).

Cache Hit Rate

Definition: Percentage of requests served from cache.

Formula:

Hit Rate = Cached Responses / Total Requests

Target: 20-50% for typical applications.

Error Rate by Type

Definition: Breakdown of errors by category.

Categories:

rate_limit - Provider throttling
timeout - Request timeout
invalid_request - Bad input
model_error - Model failure
internal - Your application error

Target: Track each type separately. Alert on anomalies.

Aggregation Guidance

Time Windows

Metric Type	Typical Window
Cost	Daily, Weekly, Monthly
Latency	Real-time, Hourly
Quality	Per-release, Weekly
Security	Real-time
Operational	Real-time, Hourly

Dimensions to Slice By

Model
Feature / endpoint
User segment
Geographic region
Time of day

Related guides: