A healthcare AI startup faced a $120,000 surprise bill after enabling 100% request-response logging for their HIPAA-compliant LLM application. Their logs grew to 2TB in a single month—costing more in storage than the actual model inference. This guide shows you how to capture everything you need for observability and compliance without destroying your budget.
LLM call logging is the foundation of production observability, but it’s a double-edged sword. Without logs, you can’t debug failures, optimize prompts, or prove compliance. With poorly implemented logging, costs spiral out of control.
The core challenge is token economics : every logged request adds storage costs that scale with your token usage. For a system processing 1M requests/day with an average of 1,000 tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly.
Debuggability : When a response goes wrong, you need the exact prompt, context, and model parameters that caused it
Cost Attribution : Break down spend by user, team, feature, or use case
Compliance : For regulated industries, immutable audit trails are non-negotiable
A production-ready LLM log entry should capture the complete lifecycle of each request. Here’s the canonical schema:
Field Type Purpose Required request_idUUID Unique identifier for correlation Yes timestampISO 8601 When the request was made Yes modelString Model name (e.g., “claude-3-5-sonnet”) Yes user_idString Attribution for cost and abuse detection Yes input_tokensInteger Token count for billing and optimization Yes output_tokensInteger Token count for billing and optimization Yes latency_msInteger Performance monitoring Yes promptString Full system + user prompt (may be truncated) Yes responseString Model output (may be truncated) Yes statusEnum success / error / timeout / rate_limited Yes
Field Type Purpose temperatureFloat Reproducibility and debugging max_tokensInteger Request configuration tools_usedArray Tool call tracking for agents context_windowInteger Model context size cache_hitBoolean Prompt caching efficiency streamingBoolean Streaming vs synchronous regionString Multi-region deployment tracking versionString Prompt template version
PII Redaction is Mandatory
Never log raw user inputs containing PII (Personal Identifiable Information). Use deterministic hashing for user IDs and redact sensitive data before logging. Failure to do so violates GDPR, HIPAA, and CCPA and can result in fines exceeding $50,000 per violation.
Logging LLM calls in production requires careful attention to data privacy regulations. The following framework ensures compliance while maintaining observability.
Redaction Strategies:
Pattern-based redaction : Remove SSNs, credit cards, phone numbers using regex
Hashing : Use SHA-256 for user IDs to maintain uniqueness without exposing identity
Selective logging : Log metadata only for sensitive domains (e.g., healthcare, finance)
Retention Policies by Compliance Framework:
Framework Max Retention Log Requirements Special Notes HIPAA 6 years Audit logs of all PHI access Encrypt at rest, access logging GDPR 30-90 days (user request) Right to erasure compliance Must be able to delete user logs SOC 2 1 year All access and modifications Immutable storage recommended PCI DSS 1 year No PAN (card numbers) in logs Tokenize before logging
from typing import Dict, Any
class PrivacySafeLLMLogger:
"""Production logger with PII redaction and sampling"""
def __init__(self, sampling_rate: float = 0.1):
self.sampling_rate = sampling_rate
def _redact_pii(self, text: str) -> str:
"""Remove SSNs, credit cards, phone numbers"""
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'), # SSN
(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'), # Credit card
(r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'), # Phone
for pattern, replacement in patterns:
text = re.sub(pattern, replacement, text)
def _hash_user_id(self, user_id: str) -> str:
"""Deterministic hash for user attribution"""
return hashlib.sha256(user_id.encode()).hexdigest()[:16]
def should_log(self, status: str) -> bool:
"""Always log errors, sample successes"""
return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)
def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""Format log entry with privacy protections"""
if not self.should_log(data.get("status", "success")):
"request_id": data["request_id"],
"timestamp": data["timestamp"],
"user_id": self._hash_user_id(data["user_id"]),
"input_tokens": data["input_tokens"],
"output_tokens": data["output_tokens"],
"latency_ms": data["latency_ms"],
"prompt": self._redact_pii(data.get("prompt", ""))[:1000],
"response": self._redact_pii(data.get("response", ""))[:1000],
"status": data["status"],
"temperature": data.get("temperature"),
"cache_hit": data.get("cache_hit", False)
Effective LLM logging directly impacts your bottom line and operational resilience. Without proper observability, you’re flying blind when costs spike or quality degrades. The economics are stark: for a system processing 1M requests/day with 1,000 average tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly cloud.google.com/vertex-ai/generative-ai/pricing .
The three pillars of logging value are:
Debuggability : When responses go wrong, you need the exact prompt, context, and model parameters that caused it
Cost Attribution : Break down spend by user, team, feature, or use case
Compliance : For regulated industries, immutable audit trails are non-negotiable
Enable Request-Response Logging
from vertexai.preview.generative_models import GenerativeModel
vertexai.init(project="your-project", location="us-central1")
model = GenerativeModel("gemini-2.5-flash")
# Configure logging with sampling
model.set_request_response_logging_config(
sampling_rate=0.1, # 10% sampling for cost control
bigquery_destination="bq://your-project.llm_logs.request_response",
Configure BigQuery Schema
CREATE TABLE llm_logs.request_response (
PARTITION BY DATE(timestamp)
CLUSTER BY model, user_id;
Set Retention Policy
ALTER TABLE llm_logs.request_response
expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
Configure Application Insights
import { TelemetryClient } from "applicationinsights";
const telemetryClient = new TelemetryClient(
process.env.APPINSIGHTS_INSTRUMENTATIONKEY
Implement Structured Logging
async function logLLMCall(entry: {
telemetryClient.trackEvent({
requestId: entry.requestId,
prompt: entry.prompt.substring(0, 1000), // Truncate
response: entry.response.substring(0, 1000),
userId: hashUserId(entry.userId) // Hash PII
inputTokens: entry.inputTokens,
outputTokens: entry.outputTokens,
latencyMs: entry.latencyMs
Set Up Cost Alerts
In Azure Portal: Monitor → Alerts → Create Alert Rule
Condition: “Consumption Budget Exceeded”
Action: Email + Webhook to Slack
Enable MLflow Tracing
from openai import OpenAI
mlflow.set_tracking_uri("http://your-mlflow-server:5000")
mlflow.openai.autolog() # Automatic instrumentation
Manual Logging with Custom Metrics
@mlflow.trace(span_type="llm_call")
def logged_llm_call(prompt: str, user_id: str) -> str:
with mlflow.start_run() as run:
response = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}]
latency = time.time() - start
content = response.choices[0].message.content
mlflow.log_metric("input_tokens", response.usage.prompt_tokens)
mlflow.log_metric("output_tokens", response.usage.completion_tokens)
mlflow.log_metric("latency_ms", latency * 1000)
mlflow.log_param("model", "gpt-4o-mini")
mlflow.log_param("user_id", hash_user_id(user_id))
# Log artifacts (truncated)
mlflow.log_text(prompt[:500], "prompt.txt")
mlflow.log_text(content[:500], "response.txt")
from typing import Dict, Any
class PrivacySafeLLMLogger:
"""Production logger with PII redaction and sampling"""
def __init__(self, sampling_rate: float = 0.1):
self.sampling_rate = sampling_rate
def _redact_pii(self, text: str) -> str:
"""Remove SSNs, credit cards, phone numbers"""
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'), # SSN
(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'), # Credit card
(r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'), # Phone
for pattern, replacement in patterns:
text = re.sub(pattern, replacement, text)
def _hash_user_id(self, user_id: str) -> str:
"""Deterministic hash for user attribution"""
return hashlib.sha256(user_id.encode()).hexdigest()[:16]
def should_log(self, status: str) -> bool:
"""Always log errors, sample successes"""
return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)
def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
"""Format log entry with privacy protections"""
if not self.should_log(data.get("status", "success")):
"request_id": data["request_id"],
"timestamp": data["timestamp"],
"user_id": self._hash_user_id(data["user_id"]),
"input_tokens": data["input_tokens"],
"output_tokens": data["output_tokens"],
"latency_ms": data["latency_ms"],
"prompt": self._redact_pii(data.get("prompt", ""))[:1000],
"response": self._redact_pii(data.get("response", ""))[:1000],
"status": data["status"],
"temperature": data.get("temperature"),
"cache_hit": data.get("cache_hit", False)
from typing import Optional, Dict, Any
from dataclasses import dataclass
from openai import OpenAI
from vertexai.preview.generative_models import GenerativeModel
temperature: Optional[float]
cache_hit: Optional[bool] = False
Avoid these production mistakes that lead to cost overruns, compliance violations, or unusable logs:
Pitfall Impact Prevention Strategy Full payload logging without sampling Storage costs can exceed inference costs by 2-3x Implement 1-10% sampling for success, 100% for errors No PII redaction GDPR/CCPA fines up to $50k per violation Use regex patterns and hashing before logging Unstructured text logs Impossible to query or analyze at scale Always use structured JSON with consistent schema Ignoring token overhead Logs consume 20-30% of token budget Calculate storage costs in your total cost model No retention policies Indefinite storage growth Set 30-90 day auto-deletion for GDPR compliance Missing correlation IDs Distributed tracing fails Generate UUIDs and pass through all services Synchronous logging Adds 50-200ms latency to user calls Use async queues or background workers Storing API keys in logs Security breach risk Filter secrets with REDACTED patterns No versioning Breaking schema changes Include schema_version field in all logs Ignoring latency metrics Performance regression blind spots Always log latency_ms alongside responses
Use this formula to determine your optimal sampling rate:
Log schema template generator
Interactive widget derived from “LLM Call Logging: What to Capture and How” that lets readers explore log schema template generator.
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.