LLM Call Logging: What to Capture and How

A healthcare AI startup faced a $120,000 surprise bill after enabling 100% request-response logging for their HIPAA-compliant LLM application. Their logs grew to 2TB in a single month—costing more in storage than the actual model inference. This guide shows you how to capture everything you need for observability and compliance without destroying your budget.

Why LLM Logging Matters

LLM call logging is the foundation of production observability, but it’s a double-edged sword. Without logs, you can’t debug failures, optimize prompts, or prove compliance. With poorly implemented logging, costs spiral out of control.

The core challenge is token economics: every logged request adds storage costs that scale with your token usage. For a system processing 1M requests/day with an average of 1,000 tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly.

The Three Pillars of LLM Logging Value

Debuggability: When a response goes wrong, you need the exact prompt, context, and model parameters that caused it
Cost Attribution: Break down spend by user, team, feature, or use case
Compliance: For regulated industries, immutable audit trails are non-negotiable

What to Log: The Essential Schema

A production-ready LLM log entry should capture the complete lifecycle of each request. Here’s the canonical schema:

Core Request-Response Fields

Field	Type	Purpose	Required
`request_id`	UUID	Unique identifier for correlation	Yes
`timestamp`	ISO 8601	When the request was made	Yes
`model`	String	Model name (e.g., “claude-3-5-sonnet”)	Yes
`user_id`	String	Attribution for cost and abuse detection	Yes
`input_tokens`	Integer	Token count for billing and optimization	Yes
`output_tokens`	Integer	Token count for billing and optimization	Yes
`latency_ms`	Integer	Performance monitoring	Yes
`prompt`	String	Full system + user prompt (may be truncated)	Yes
`response`	String	Model output (may be truncated)	Yes
`status`	Enum	success / error / timeout / rate_limited	Yes

Advanced Metadata Fields

Field	Type	Purpose
`temperature`	Float	Reproducibility and debugging
`max_tokens`	Integer	Request configuration
`tools_used`	Array	Tool call tracking for agents
`context_window`	Integer	Model context size
`cache_hit`	Boolean	Prompt caching efficiency
`streaming`	Boolean	Streaming vs synchronous
`region`	String	Multi-region deployment tracking
`version`	String	Prompt template version

Privacy and Compliance Considerations

Logging LLM calls in production requires careful attention to data privacy regulations. The following framework ensures compliance while maintaining observability.

Data Classification and Handling

Redaction Strategies:

Pattern-based redaction: Remove SSNs, credit cards, phone numbers using regex
Hashing: Use SHA-256 for user IDs to maintain uniqueness without exposing identity
Selective logging: Log metadata only for sensitive domains (e.g., healthcare, finance)

Retention Policies by Compliance Framework:

Framework	Max Retention	Log Requirements	Special Notes
HIPAA	6 years	Audit logs of all PHI access	Encrypt at rest, access logging
GDPR	30-90 days (user request)	Right to erasure compliance	Must be able to delete user logs
SOC 2	1 year	All access and modifications	Immutable storage recommended
PCI DSS	1 year	No PAN (card numbers) in logs	Tokenize before logging

Implementation: Privacy-Preserving Logger

Python

import hashlib
import re
import json
from typing import Dict, Any

class PrivacySafeLLMLogger:
    """Production logger with PII redaction and sampling"""

    def __init__(self, sampling_rate: float = 0.1):
        self.sampling_rate = sampling_rate

    def _redact_pii(self, text: str) -> str:
        """Remove SSNs, credit cards, phone numbers"""
        patterns = [
            (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'),  # SSN
            (r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'),  # Credit card
            (r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'),  # Phone
        ]
        for pattern, replacement in patterns:
            text = re.sub(pattern, replacement, text)
        return text

    def _hash_user_id(self, user_id: str) -> str:
        """Deterministic hash for user attribution"""
        return hashlib.sha256(user_id.encode()).hexdigest()[:16]

    def should_log(self, status: str) -> bool:
        """Always log errors, sample successes"""
        return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)

    def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
        """Format log entry with privacy protections"""
        if not self.should_log(data.get("status", "success")):
            return None

        return {
            "request_id": data["request_id"],
            "timestamp": data["timestamp"],
            "model": data["model"],
            "user_id": self._hash_user_id(data["user_id"]),
            "input_tokens": data["input_tokens"],
            "output_tokens": data["output_tokens"],
            "latency_ms": data["latency_ms"],
            "prompt": self._redact_pii(data.get("prompt", ""))[:1000],
            "response": self._redact_pii(data.get("response", ""))[:1000],
            "status": data["status"],
            "temperature": data.get("temperature"),
            "cache_hit": data.get("cache_hit", False)
        }

Why This Matters

Effective LLM logging directly impacts your bottom line and operational resilience. Without proper observability, you’re flying blind when costs spike or quality degrades. The economics are stark: for a system processing 1M requests/day with 1,000 average tokens per request, unlogged payload storage can consume 20-30% of your total token budget if not managed properly cloud.google.com/vertex-ai/generative-ai/pricing.

The three pillars of logging value are:

Debuggability: When responses go wrong, you need the exact prompt, context, and model parameters that caused it
Cost Attribution: Break down spend by user, team, feature, or use case
Compliance: For regulated industries, immutable audit trails are non-negotiable

Practical Implementation

Production-Ready Logging Architecture

Enable Request-Response Logging

import vertexai
from vertexai.preview.generative_models import GenerativeModel

vertexai.init(project="your-project", location="us-central1")
model = GenerativeModel("gemini-2.5-flash")

# Configure logging with sampling
model.set_request_response_logging_config(
    enabled=True,
    sampling_rate=0.1,  # 10% sampling for cost control
    bigquery_destination="bq://your-project.llm_logs.request_response",
    enable_otel_logging=True
)

Configure BigQuery Schema

CREATE TABLE llm_logs.request_response (
  request_id STRING,
  timestamp TIMESTAMP,
  model STRING,
  user_id STRING,
  input_tokens INT64,
  output_tokens INT64,
  latency_ms INT64,
  prompt STRING,
  response STRING,
  status STRING,
  temperature FLOAT64,
  cache_hit BOOL
)
PARTITION BY DATE(timestamp)
CLUSTER BY model, user_id;

Set Retention Policy

ALTER TABLE llm_logs.request_response
SET OPTIONS (
  expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
);

Configure Application Insights

import { TelemetryClient } from "applicationinsights";

const telemetryClient = new TelemetryClient(
  process.env.APPINSIGHTS_INSTRUMENTATIONKEY
);

Implement Structured Logging

async function logLLMCall(entry: {
  requestId: string;
  model: string;
  prompt: string;
  response: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  userId?: string;
}): Promise<void> {
  telemetryClient.trackEvent({
    name: "LLMRequest",
    properties: {
      requestId: entry.requestId,
      model: entry.model,
      prompt: entry.prompt.substring(0, 1000), // Truncate
      response: entry.response.substring(0, 1000),
      userId: hashUserId(entry.userId) // Hash PII
    },
    measurements: {
      inputTokens: entry.inputTokens,
      outputTokens: entry.outputTokens,
      latencyMs: entry.latencyMs
    }
  });
}

Set Up Cost Alerts In Azure Portal: Monitor → Alerts → Create Alert Rule
- Condition: “Consumption Budget Exceeded”
- Action: Email + Webhook to Slack

Enable MLflow Tracing

import mlflow
from openai import OpenAI

mlflow.set_tracking_uri("http://your-mlflow-server:5000")
mlflow.openai.autolog()  # Automatic instrumentation

Manual Logging with Custom Metrics

@mlflow.trace(span_type="llm_call")
def logged_llm_call(prompt: str, user_id: str) -> str:
    with mlflow.start_run() as run:
        client = OpenAI()
        start = time.time()

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )

        latency = time.time() - start
        content = response.choices[0].message.content

        # Log metrics
        mlflow.log_metric("input_tokens", response.usage.prompt_tokens)
        mlflow.log_metric("output_tokens", response.usage.completion_tokens)
        mlflow.log_metric("latency_ms", latency * 1000)
        mlflow.log_param("model", "gpt-4o-mini")
        mlflow.log_param("user_id", hash_user_id(user_id))

        # Log artifacts (truncated)
        mlflow.log_text(prompt[:500], "prompt.txt")
        mlflow.log_text(content[:500], "response.txt")

        return content

Privacy-Preserving Logger Component

import hashlib
import re
import json
from typing import Dict, Any

class PrivacySafeLLMLogger:
    """Production logger with PII redaction and sampling"""

    def __init__(self, sampling_rate: float = 0.1):
        self.sampling_rate = sampling_rate

    def _redact_pii(self, text: str) -> str:
        """Remove SSNs, credit cards, phone numbers"""
        patterns = [
            (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]'),  # SSN
            (r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CC]'),  # Credit card
            (r'\b\d{3}[\s-]?\d{3}[\s-]?\d{4}\b', '[PHONE]'),  # Phone
        ]
        for pattern, replacement in patterns:
            text = re.sub(pattern, replacement, text)
        return text

    def _hash_user_id(self, user_id: str) -> str:
        """Deterministic hash for user attribution"""
        return hashlib.sha256(user_id.encode()).hexdigest()[:16]

    def should_log(self, status: str) -> bool:
        """Always log errors, sample successes"""
        return status != "success" or (hash(str(time.time())) % 100) < (self.sampling_rate * 100)

    def format_log(self, data: Dict[str, Any]) -> Dict[str, Any]:
        """Format log entry with privacy protections"""
        if not self.should_log(data.get("status", "success")):
            return None

        return {
            "request_id": data["request_id"],
            "timestamp": data["timestamp"],
            "model": data["model"],
            "user_id": self._hash_user_id(data["user_id"]),
            "input_tokens": data["input_tokens"],
            "output_tokens": data["output_tokens"],
            "latency_ms": data["latency_ms"],
            "prompt": self._redact_pii(data.get("prompt", ""))[:1000],
            "response": self._redact_pii(data.get("response", ""))[:1000],
            "status": data["status"],
            "temperature": data.get("temperature"),
            "cache_hit": data.get("cache_hit", False)
        }

Code Example

Complete Production Implementation

import os
import time
import hashlib
import json
from typing import Optional, Dict, Any
from dataclasses import dataclass
from openai import OpenAI
import vertexai
from vertexai.preview.generative_models import GenerativeModel

@dataclass
class LLMLogEntry:
    request_id: str
    timestamp: str
    model: str
    user_id: str
    input_tokens: int
    output_tokens: int
    latency_ms: int
    prompt: str
    response: str
    status: str
    temperature: Optional[float]
    cache_hit: Optional[bool] = False

Common Pitfalls

Avoid these production mistakes that lead to cost overruns, compliance violations, or unusable logs:

Pitfall	Impact	Prevention Strategy
Full payload logging without sampling	Storage costs can exceed inference costs by 2-3x	Implement 1-10% sampling for success, 100% for errors
No PII redaction	GDPR/CCPA fines up to $50k per violation	Use regex patterns and hashing before logging
Unstructured text logs	Impossible to query or analyze at scale	Always use structured JSON with consistent schema
Ignoring token overhead	Logs consume 20-30% of token budget	Calculate storage costs in your total cost model
No retention policies	Indefinite storage growth	Set 30-90 day auto-deletion for GDPR compliance
Missing correlation IDs	Distributed tracing fails	Generate UUIDs and pass through all services
Synchronous logging	Adds 50-200ms latency to user calls	Use async queues or background workers
Storing API keys in logs	Security breach risk	Filter secrets with REDACTED patterns
No versioning	Breaking schema changes	Include `schema_version` field in all logs
Ignoring latency metrics	Performance regression blind spots	Always log `latency_ms` alongside responses

Quick Reference

Sampling Rate Calculator

Use this formula to determine your optimal sampling rate:

Log schema template generator

Interactive widget derived from “LLM Call Logging: What to Capture and How” that lets readers explore log schema template generator.

Key models to cover:

Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.