Skip to content
GitHubX/TwitterRSS

LLM Call Logging: What to Capture and How

A healthcare AI startup faced a $120,000 surprise bill after enabling 100% request-response logging for their HIPAA-compliant LLM application. Their logs grew to 2TB in a single month—costing more in storage than the actual model inference. This guide shows you how to capture everything you need for observability and compliance without destroying your budget.

LLM call logging is the foundation of production observability, but it’s a double-edged sword. Without logs, you can’t debug failures, optimize prompts, or prove compliance. With poorly implemented logging, costs spiral out of control.

The core challenge is token economics: every logged request adds storage costs that scale with your token usage. For a system processing 1M requests/day with an average of 1,000 tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly.

  1. Debuggability: When a response goes wrong, you need the exact prompt, context, and model parameters that caused it
  2. Cost Attribution: Break down spend by user, team, feature, or use case
  3. Compliance: For regulated industries, immutable audit trails are non-negotiable

A production-ready LLM log entry should capture the complete lifecycle of each request. Here’s the canonical schema:

FieldTypePurposeRequired
request_idUUIDUnique identifier for correlationYes
timestampISO 8601When the request was madeYes
modelStringModel name (e.g., “claude-3-5-sonnet”)Yes
user_idStringAttribution for cost and abuse detectionYes
input_tokensIntegerToken count for billing and optimizationYes
output_tokensIntegerToken count for billing and optimizationYes
latency_msIntegerPerformance monitoringYes
promptStringFull system + user prompt (may be truncated)Yes
responseStringModel output (may be truncated)Yes
statusEnumsuccess / error / timeout / rate_limitedYes
FieldTypePurpose
temperatureFloatReproducibility and debugging
max_tokensIntegerRequest configuration
tools_usedArrayTool call tracking for agents
context_windowIntegerModel context size
cache_hitBooleanPrompt caching efficiency
streamingBooleanStreaming vs synchronous
regionStringMulti-region deployment tracking
versionStringPrompt template version

Logging LLM calls in production requires careful attention to data privacy regulations. The following framework ensures compliance while maintaining observability.

Redaction Strategies:

  1. Pattern-based redaction: Remove SSNs, credit cards, phone numbers using regex
  2. Hashing: Use SHA-256 for user IDs to maintain uniqueness without exposing identity
  3. Selective logging: Log metadata only for sensitive domains (e.g., healthcare, finance)

Retention Policies by Compliance Framework:

FrameworkMax RetentionLog RequirementsSpecial Notes
HIPAA6 yearsAudit logs of all PHI accessEncrypt at rest, access logging
GDPR30-90 days (user request)Right to erasure complianceMust be able to delete user logs
SOC 21 yearAll access and modificationsImmutable storage recommended
PCI DSS1 yearNo PAN (card numbers) in logsTokenize before logging
import hashlib
import re
import json
from typing import Dict, Any
class PrivacySafeLLMLogger:
"""Production logger with PII redaction and sampling"""
def __init__(self, sampling_rate: float = 0.1):
self.sampling_rate = sampling_rate
def _redact_pii(self, text: str) -> str:
"""Remove SSNs, credit cards, phone numbers"""
patterns = [
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'), # SSN
(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'), # Credit card
(r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'), # Phone
]
for pattern, replacement in patterns:
text = re.sub(pattern, replacement, text)
return text
def _hash_user_id(self, user_id: str) -> str:
"""Deterministic hash for user attribution"""
return hashlib.sha256(user_id.encode()).hexdigest()[:16]
def should_log(self, status: str) -> bool:
"""Always log errors, sample successes"""
return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)
def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""Format log entry with privacy protections"""
if not self.should_log(data.get("status", "success")):
return None
return {
"request_id": data["request_id"],
"timestamp": data["timestamp"],
"model": data["model"],
"user_id": self._hash_user_id(data["user_id"]),
"input_tokens": data["input_tokens"],
"output_tokens": data["output_tokens"],
"latency_ms": data["latency_ms"],
"prompt": self._redact_pii(data.get("prompt", ""))[:1000],
"response": self._redact_pii(data.get("response", ""))[:1000],
"status": data["status"],
"temperature": data.get("temperature"),
"cache_hit": data.get("cache_hit", False)
}

Effective LLM logging directly impacts your bottom line and operational resilience. Without proper observability, you’re flying blind when costs spike or quality degrades. The economics are stark: for a system processing 1M requests/day with 1,000 average tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly cloud.google.com/vertex-ai/generative-ai/pricing.

The three pillars of logging value are:

  1. Debuggability: When responses go wrong, you need the exact prompt, context, and model parameters that caused it
  2. Cost Attribution: Break down spend by user, team, feature, or use case
  3. Compliance: For regulated industries, immutable audit trails are non-negotiable
  1. Enable Request-Response Logging

    import vertexai
    from vertexai.preview.generative_models import GenerativeModel
    vertexai.init(project="your-project", location="us-central1")
    model = GenerativeModel("gemini-2.5-flash")
    # Configure logging with sampling
    model.set_request_response_logging_config(
    enabled=True,
    sampling_rate=0.1, # 10% sampling for cost control
    bigquery_destination="bq://your-project.llm_logs.request_response",
    enable_otel_logging=True
    )
  2. Configure BigQuery Schema

    CREATE TABLE llm_logs.request_response (
    request_id STRING,
    timestamp TIMESTAMP,
    model STRING,
    user_id STRING,
    input_tokens INT64,
    output_tokens INT64,
    latency_ms INT64,
    prompt STRING,
    response STRING,
    status STRING,
    temperature FLOAT64,
    cache_hit BOOL
    )
    PARTITION BY DATE(timestamp)
    CLUSTER BY model, user_id;
  3. Set Retention Policy

    ALTER TABLE llm_logs.request_response
    SET OPTIONS (
    expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
    );
import hashlib
import re
import json
from typing import Dict, Any
class PrivacySafeLLMLogger:
"""Production logger with PII redaction and sampling"""
def __init__(self, sampling_rate: float = 0.1):
self.sampling_rate = sampling_rate
def _redact_pii(self, text: str) -> str:
"""Remove SSNs, credit cards, phone numbers"""
patterns = [
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'), # SSN
(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'), # Credit card
(r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'), # Phone
]
for pattern, replacement in patterns:
text = re.sub(pattern, replacement, text)
return text
def _hash_user_id(self, user_id: str) -> str:
"""Deterministic hash for user attribution"""
return hashlib.sha256(user_id.encode()).hexdigest()[:16]
def should_log(self, status: str) -> bool:
"""Always log errors, sample successes"""
return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)
def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""Format log entry with privacy protections"""
if not self.should_log(data.get("status", "success")):
return None
return {
"request_id": data["request_id"],
"timestamp": data["timestamp"],
"model": data["model"],
"user_id": self._hash_user_id(data["user_id"]),
"input_tokens": data["input_tokens"],
"output_tokens": data["output_tokens"],
"latency_ms": data["latency_ms"],
"prompt": self._redact_pii(data.get("prompt", ""))[:1000],
"response": self._redact_pii(data.get("response", ""))[:1000],
"status": data["status"],
"temperature": data.get("temperature"),
"cache_hit": data.get("cache_hit", False)
}
import os
import time
import hashlib
import json
from typing import Optional, Dict, Any
from dataclasses import dataclass
from openai import OpenAI
import vertexai
from vertexai.preview.generative_models import GenerativeModel
@dataclass
class LLMLogEntry:
request_id: str
timestamp: str
model: str
user_id: str
input_tokens: int
output_tokens: int
latency_ms: int
prompt: str
response: str
status: str
temperature: Optional[float]
cache_hit: Optional[bool] = False

Avoid these production mistakes that lead to cost overruns, compliance violations, or unusable logs:

PitfallImpactPrevention Strategy
Full payload logging without samplingStorage costs can exceed inference costs by 2-3xImplement 1-10% sampling for success, 100% for errors
No PII redactionGDPR/CCPA fines up to $50k per violationUse regex patterns and hashing before logging
Unstructured text logsImpossible to query or analyze at scaleAlways use structured JSON with consistent schema
Ignoring token overheadLogs consume 20-30% of token budgetCalculate storage costs in your total cost model
No retention policiesIndefinite storage growthSet 30-90 day auto-deletion for GDPR compliance
Missing correlation IDsDistributed tracing failsGenerate UUIDs and pass through all services
Synchronous loggingAdds 50-200ms latency to user callsUse async queues or background workers
Storing API keys in logsSecurity breach riskFilter secrets with REDACTED patterns
No versioningBreaking schema changesInclude schema_version field in all logs
Ignoring latency metricsPerformance regression blind spotsAlways log latency_ms alongside responses

Use this formula to determine your optimal sampling rate:

Log schema template generator

Interactive widget derived from “LLM Call Logging: What to Capture and How” that lets readers explore log schema template generator.

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.