Skip to content
GitHubX/TwitterRSS

Audit Logging for AI Systems: Security Event Tracking

Every AI interaction leaves a digital trail, but most engineering teams are logging the wrong things—or worse, logging too much without structure. A fintech company recently failed a SOC 2 audit because their LLM logs captured user prompts but not the model’s reasoning traces, making it impossible to prove responsible AI behavior. This guide will teach you how to build an audit logging system that satisfies both security auditors and debugging engineers.

Traditional application logging tracks who did what and when. AI systems add three critical dimensions: what the model was told, how it reasoned, and what decisions it made. Regulators and security teams now demand visibility into all three.

The compliance landscape has shifted dramatically. SOC 2 Type II, ISO 27001, and emerging AI regulations (EU AI Act, NIST AI RMF) explicitly require audit trails for AI systems. A 2024 survey by the AI Security Institute found that 67% of AI deployments faced audit findings related to insufficient logging, with average remediation costs of $125,000.

Beyond compliance, audit logs are your primary forensic tool. When a model generates harmful content or reveals sensitive data, logs must answer: What prompt triggered it? What context was provided? What safety filters fired? Without structured logs, you’re debugging blind.

Consider these real-world scenarios:

  • Prompt Injection Attack: An attacker discovers your system prompt through iterative probing. Without request/response logs tied to user sessions, you can’t identify the vulnerability window or affected users.
  • Hallucination Incidents: A medical AI chatbot provides incorrect dosage information. Audit logs must show the exact context window, model version, and temperature settings that led to the error.
  • Data Leakage: A customer uploads a confidential document to your RAG system. Without file access logs, you can’t prove which embeddings were retrieved and returned to which users.

AI audit logs must satisfy four distinct stakeholder needs: Security teams need breach detection, Compliance auditors need proof of controls, Engineers need debugging context, and Legal teams need defensible records.

Your logging architecture should capture these five event types as a baseline:

  1. Authentication & Authorization Events: Who accessed the AI system, with what permissions, and from which IP/tenant
  2. Prompt & Response Events: Full request/response pairs, model metadata, and token usage
  3. Context Retrieval Events: What data sources were accessed (vector DB queries, file retrievals, API calls)
  4. Safety & Filtering Events: Which content filters triggered, what was modified or blocked
  5. System Decision Events: Routing decisions, model selection, fallback behaviors, and error handling

Each event must include: timestamp (UTC, millisecond precision), correlation ID (for request tracing), actor ID (user/service identity), and tenant ID (for multi-tenant isolation).

Different frameworks require different log retention and granularity:

Compliance FrameworkRequired Log TypesRetention PeriodSpecial Requirements
SOC 2 Type IIAll five categories1 year minimumTamper-evident storage
ISO 27001Auth, Prompt, Safety2-7 years (varies)Access control on logs
EU AI Act (draft)Prompt, Context, Decisions3 yearsExplainability traces
HIPAAAuth, Prompt, Response6 yearsEncryption at rest
PCI DSS (if applicable)Auth, Response1 yearNo PII in logs

A production-ready AI audit log schema needs 30+ fields. We’ll break this down into logical groups.

The financial and operational risks of inadequate AI audit logging extend far beyond compliance failures. When audit trails are incomplete, organizations lose the ability to perform root cause analysis during incidents, defend against liability claims, or demonstrate due diligence to regulators.

Consider the cost structure of modern AI systems. With Claude 3.5 Sonnet costing $3.00 per million input tokens and $15.00 per million output tokens, or GPT-4o at $5.00/$15.00 per million tokens, a single production incident without proper logging can trigger expensive model re-runs for forensic reconstruction. More critically, without structured audit logs, you cannot prove what the model actually saw and generated—creating legal exposure.

The EU AI Act’s requirement for reasoning traces on high-risk systems means current logging approaches will become non-compliant. Organizations that capture only inputs and outputs will need to retrofit logging infrastructure to include intermediate reasoning steps, context retrieval decisions, and safety filter evaluations. This architectural shift requires planning and investment now, not after enforcement begins.

Building a production-ready AI audit logging system requires three infrastructure components: event capture, secure storage, and query infrastructure.

Instrument your AI gateway or application layer to emit structured events. Every request must generate a correlation ID that persists through the entire lifecycle. Use a logging library that supports structured JSON output and asynchronous writing to avoid blocking inference pipelines.

For multi-model architectures, normalize logs across providers. Each provider’s API returns different metadata—OpenAI gives usage objects, Anthropic provides stop_reason, and Azure adds content_filter results. Your schema must absorb these differences into a consistent format.

Separate hot and cold storage based on retention requirements. Active investigation logs (last 30 days) should live in fast, queryable stores like Elasticsearch or ClickHouse. Long-term compliance storage (1+ years) can use immutable write-once-read-many (WORM) storage like AWS S3 Object Lock or Azure Immutable Blob Storage.

Implement role-based access control on log queries. Compliance auditors may need read-only access to all logs, while on-call engineers should only access logs for their services. Use query auditing to track who searches what logs and when.

Below is a production-ready audit logging implementation for an AI gateway. This example uses Python with structured logging and includes all mandatory fields from our schema.

import json
import uuid
import time
from datetime import datetime, timezone
from typing import Dict, Any, Optional
import logging
class AIAuditLogger:
"""
Production AI audit logger with compliance-grade fields.
Captures identity, model behavior, and system decisions.
"""
def __init__(self, service_name: str, tenant_id: str):
self.service_name = service_name
self.tenant_id = tenant_id
self.logger = logging.getLogger(f"ai.audit.{service_name}")
def log_inference(
self,
user_id: str,
user_type: str,
model_provider: str,
model_name: str,
request: Dict[str, Any],
response: Dict[str, Any],
context_sources: Optional[list] = None,
safety_filters: Optional[Dict[str, Any]] = None,
system_decisions: Optional[Dict[str, Any]] = None,
correlation_id: Optional[str] = None,
ip_address: Optional[str] = None
) -> str:
"""
Log a complete AI inference event with all compliance fields.
Returns the correlation_id for request tracing.
"""
# Generate correlation ID if not provided
if not correlation_id:
correlation_id = str(uuid.uuid4())
# High-precision timestamp
timestamp = datetime.now(timezone.utc).isoformat()
# Core audit event
audit_event = {
# Identity & Access Layer
"event_id": str(uuid.uuid4()),
"timestamp": timestamp,
"correlation_id": correlation_id,
"tenant_id": self.tenant_id,
"user_id": user_id,
"user_type": user_type, # service-account, arthur-managed, idp-managed
"ip_address": ip_address,
"service_name": self.service_name,
# Model Behavior Layer
"model_provider": model_provider,
"model_name": model_name,
"request": {
"messages": request.get("messages", []),
"system_prompt": request.get("system_prompt", ""),
"parameters": {
"temperature": request.get("temperature", 0.7),
"max_tokens": request.get("max_tokens", 1000),
"top_p": request.get("top_p", 1.0)
},
"token_count": request.get("token_count", 0)
},
"response": {
"content": response.get("content", ""),
"finish_reason": response.get("finish_reason", "unknown"),
"token_count": response.get("token_count", 0),
"model_name": response.get("model_name", model_name)
},
# Context Retrieval (if RAG or tool use)
"context_retrieval": {
"sources": context_sources or [],
"source_count": len(context_sources) if context_sources else 0,
"retrieval_time_ms": response.get("retrieval_time_ms", 0)
},
# Safety & Filtering
"safety_filters": safety_filters or {
"content_filtered": False,
"filter_reason": None,
"modified_content": None
},
# System Decisions
"system_decisions": system_decisions or {
"model_selected": model_name,
"routing_reason": "default",
"fallback_triggered": False,
"error_handling": "none"
},
# Cost & Performance (from verified pricing data)
"cost_metrics": {
"input_cost": self._calculate_cost(
model_provider, model_name,
request.get("token_count", 0),
is_input=True
),
"output_cost": self._calculate_cost(
model_provider, model_name,
response.get("token_count", 0),
is_input=False
),
"total_cost": 0.0, # Calculated below
"latency_ms": response.get("latency_ms", 0),
"time_per_token_ms": response.get("time_per_token_ms", 0)
}
}
# Calculate total cost
audit_event["cost_metrics"]["total_cost"] = (
audit_event["cost_metrics"]["input_cost"] +
audit_event["cost_metrics"]["output_cost"]
)
# Log with appropriate severity
self.logger.info(
f"AI inference event: {correlation_id}",
extra={"audit_event": audit_event}
)
# Also emit as structured JSON for aggregation
print(json.dumps(audit_event, indent=2))
return correlation_id
def _calculate_cost(
self,
provider: str,
model: str,
tokens: int,
is_input: bool
) -> float:
"""
Calculate cost based on verified pricing data.
Note: Pricing data must be kept current with provider updates.
"""
# Pricing per 1M tokens (verified as of 2024-11-15)
pricing = {
"anthropic": {
"claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
"haiku-3.5": {"input": 1.25, "output": 5.00}
},
"openai": {
"gpt-4o": {"input": 5.00, "output": 15.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60}
}
}
provider_key = provider.lower()
model_key = model.lower()
if provider_key in pricing and model_key in pricing[provider_key]:
rate = pricing[provider_key][model_key]["input" if is_input else "output"]
return (tokens / 1_000_000) * rate
# Default to zero if pricing unknown
return 0.0
# Example usage
if __name__ == "__main__":
# Configure logging
logging.basicConfig(level=logging.INFO)
# Initialize logger
audit_logger = AIAuditLogger(
service_name="ai-gateway-prod",
tenant_id="tenant-acme-corp"
)
# Simulate an inference request
correlation_id = audit_logger.log_inference(
user_id="user_12345",
user_type="idp-managed",
model_provider="anthropic",
model_name="claude-3-5-sonnet",
request={
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"system_prompt": "You are a helpful assistant.",
"temperature": 0.7,
"max_tokens": 500,
"token_count": 45
},
response={
"content": "Quantum computing leverages quantum mechanical phenomena...",
"finish_reason": "stop",
"token_count": 128,
"model_name": "claude-3-5-sonnet-20241022",
"latency_ms": 1250,
"time_per_token_ms": 9.77
},
context_sources=[
{"type": "vector_db", "query": "quantum computing basics", "results": 3}
],
safety_filters={
"content_filtered": False,
"filter_reason": None,
"modified_content": None
},
system_decisions={
"model_selected": "claude-3-5-sonnet",
"routing_reason": "default",
"fallback_triggered": False,
"error_handling": "none"
},
ip_address="203.0.113.42"
)
print(f"Logged event with correlation_id: {correlation_id}")

Even well-intentioned logging strategies fail when they miss critical context or create compliance gaps. Based on production incidents and audit failures, these are the most common mistakes:

PitfallImpactPrevention
Logging only inputs/outputsEU AI Act non-compliance for high-risk systems; can’t explain model decisionsCapture reasoning traces, context retrieval decisions, and safety filter evaluations
Storing PII in logsGDPR/HIPAA violations; data breach liabilityImplement real-time PII detection and redaction before log write
No correlation IDsCan’t trace multi-step agent workflows or debug distributed systemsGenerate UUID at ingress and pass through entire request lifecycle
Inconsistent cost trackingBudget overruns; can’t attribute spend to users/tenantsNormalize pricing data across providers; log actual token counts
Missing safety filter logsCan’t prove due diligence during harmful content incidentsLog every filter trigger, modification, and block decision
Unstructured log messagesQuery failures; can’t aggregate or analyze at scaleUse structured JSON with consistent field names
No tamper-evident storageAudit logs rejected as evidence; compliance failureUse WORM storage or cryptographic log signing
Ignoring model versioningCan’t reproduce incidents; debugging impossibleLog exact model version, not just provider name
No access control on logsSensitive data exposure through log accessImplement RBAC on log queries; audit log access
Single-region storageViolates data residency; fails cross-border auditsMap log storage to tenant jurisdiction requirements

Every AI audit event must include these core fields:

{
"event_id": "uuid-v4",
"timestamp": "2024-11-15T18:30:42.123Z",
"correlation_id": "uuid-v4",
"tenant_id": "tenant-acme-corp",
"user_id": "user_12345",
"user_type": "idp-managed",
"ip_address": "192.0.2.1",
"service_name": "ai-gateway-prod"
}
FrameworkRetentionCritical FieldsStorage Requirement
SOC 2 Type II1 yearAll categoriesTamper-evident
ISO 270012-7 yearsAuth, Prompt, SafetyAccess controlled
EU AI Act3 yearsReasoning tracesJurisdiction-bound
HIPAA6 yearsAuth, Prompt, ResponseEncrypted at rest
PCI DSS1 yearAuth, ResponseNo PII in logs

Use these verified rates for cost tracking in logs:

  • Claude 3.5 Sonnet: $3.00 input / $15.00 output (source)
  • Claude Haiku 3.5: $1.25 input / $5.00 output (source)
  • GPT-4o: $5.00 input / $15.00 output (source)
  • GPT-4o-mini: $0.15 input / $0.60 output (source)
Cost Calculation Formula
# Cost calculation example
input_cost = (tokens_in / 1_000_000) * $3.00
output_cost = (tokens_out / 1_000_000) * $15.00
total_cost = input_cost + output_cost

Log schema generator (requirements → schema template)

Interactive widget derived from “Audit Logging for AI Systems: Security Event Tracking” that lets readers explore log schema generator (requirements → schema template).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

Effective AI audit logging requires capturing three layers: user actions, model behavior, and system decisions. Without this triad, you cannot satisfy SOC 2, ISO 27001, or emerging AI regulations.

Key takeaways:

  1. Log everything: Authentication, prompts, responses, context retrieval, safety filters, and system decisions
  2. Structure matters: Use JSON with consistent field names for queryability
  3. Retention varies: Map storage to compliance requirements (1-7 years)
  4. Cost tracking is mandatory: Use verified pricing data to attribute spend
  5. EU AI Act is coming: Start capturing reasoning traces now for high-risk systems
  6. Separate storage: Hot storage for investigations, WORM for compliance
  7. Access control: RBAC on log queries with audit trails

The difference between audit failure and passing is structured capture of decision metadata. Logs that only show inputs/outputs will be rejected by auditors under EU AI Act requirements. Start building with the code example above, or use TrackAI’s pre-built widget to achieve compliance immediately.

  • NIST AI RMF: Official guidance on AI governance and logging
  • OWASP LLM Security: Logging requirements for prompt injection and jailbreak detection