Skip to content
GitHubX/TwitterRSS

Anatomy of an Agent Failure: A Post-Mortem

Incident Date: October 15, 2024 Duration: 47 minutes Cost: $8,412 Root Cause: Missing recursion limits Status: Resolved

This is the post-mortem of an agent failure that cost us five figures and several hours of sleep. We’re publishing it because these failures are common, under-documented, and preventable.

We built an AI research agent for competitive intelligence. Users could ask questions like “What are the key features of [competitor]‘s enterprise plan?” and the agent would:

  1. Search the web for relevant information
  2. Read and parse web pages
  3. Synthesize findings into a structured report
  4. Spawn sub-agents for deep dives on specific topics

The architecture looked like this:

Orchestrator Agent
├── Web Search Tool
├── Page Reader Tool
└── Deep Dive Agent (spawnable)
├── Web Search Tool
└── Page Reader Tool

We were proud of it. Users loved the depth of analysis. Then came October 15th.

At 3:47 AM UTC, a user submitted a query:

“Comprehensive analysis of the AI observability market landscape including all major players, their features, pricing, and market positioning”

This was an ambitious query. The orchestrator correctly identified it as needing deep analysis and began spawning sub-agents.

Here’s what the logs showed:

3:47:12 — Orchestrator identifies 12 major players in the space 3:47:34 — Spawns 12 Deep Dive Agents, one per competitor 3:48:01 — Deep Dive Agents begin researching 3:49:45 — Agent #3 discovers “related tools” and decides to spawn sub-agents 3:50:12 — Agent #7 does the same 3:52:30 — 47 total agents running 3:55:00 — 156 agents running 3:58:00 — API rate limits hit, requests backing up 4:02:00 — Queue depth: 3,400 pending requests 4:15:00 — PagerDuty alert fires (30 minutes late due to alert misconfiguration) 4:23:00 — On-call engineer sees alert, begins investigation 4:34:00 — Manual kill switch triggered 4:34:00 — Final agent count: 312 spawned, 47 still active

ResourceUsageCost
GPT-4 tokens2.1M$6,300
Claude tokens890K$1,340
Web search API12,400 calls$620
Embedding API45K calls$90
Infrastructure47 min burst$62
Total$8,412

The Deep Dive Agent could spawn more Deep Dive Agents. There was no depth limit. This was a conscious “feature” decision—we wanted agents to be able to do thorough research. We didn’t anticipate this failure mode.

The fix: Maximum agent depth of 2. The orchestrator (depth 0) can spawn Deep Dive Agents (depth 1), but those cannot spawn further agents.

Even with depth limits, breadth was unbounded. An agent could spawn 100 sub-agents if it identified 100 sub-topics.

The fix: Maximum 10 concurrent agents per query. Queue additional spawns and process sequentially.

Each agent had a token budget, but there was no aggregate budget for the entire query. 12 agents × 100K tokens each = a very big number.

The fix: Per-query budget of 500K tokens. If the query needs more, it should checkpoint and ask the user to continue.

Our cost alerts were set at the account level, not per-query. The query burned $8K before any alert fired.

The fix: Per-query cost thresholds. Alert at $10, pause at $50, hard stop at $100.

When API rate limits hit, agents kept retrying with exponential backoff. This extended the incident duration because agents weren’t failing fast.

The fix: Circuit breaker pattern. After 3 failures in 10 seconds, stop trying and report failure.

Our logs captured tool calls but not agent spawning decisions. We couldn’t easily see why Agent #3 decided to spawn sub-agents.

The fix: Log reasoning steps, not just actions. Every spawn decision should include the reasoning that led to it.

What we could reconstruct:

[3:47:12] Orchestrator spawned
[3:47:34] Sub-agent spawned (id: abc123)
[3:47:35] Tool call: web_search("competitor A features")
[3:47:38] Tool result: 2,400 characters
...

What we wish we had:

[3:47:12] Orchestrator spawned
[3:47:12] Reasoning: "Query requests comprehensive analysis. Identifying major players..."
[3:47:30] Decision: Spawn 12 sub-agents for deep dive
[3:47:30] Reasoning: "12 competitors identified. Each needs dedicated research."
[3:47:30] Budget check: 500K available, allocating 40K per sub-agent
[3:47:34] Sub-agent abc123 spawned (depth: 1, budget: 40K, parent: orchestrator)
[3:49:45] Sub-agent abc123 reasoning: "Found 5 related tools. Should I spawn sub-agents?"
[3:49:45] Sub-agent abc123 decision: Spawn 5 sub-sub-agents
[3:49:45] ⚠️ LIMIT HIT: Max depth exceeded, denying spawn request

This trace would have shown us the cascade in real-time and enabled automatic intervention.

Our kill switch was a manual process:

  1. SSH into server
  2. Find running processes
  3. Send SIGTERM
  4. Hope cleanup handlers work

Our kill switch now:

  1. Single API call: POST /emergency/stop?query_id=xxx
  2. All agents receive immediate termination signal
  3. In-flight API calls are cancelled
  4. Partial results are saved
  5. User is notified with explanation

Also: a big red button in the admin dashboard.

  1. Recursive systems need hard limits — Even if you trust your agents, bound the recursion
  2. Budget at every level — Per-token, per-agent, per-query, per-user, per-day
  3. Trace decisions, not just actions — You need to understand why things happened
  4. Alert on anomalies, not just thresholds — 50 agents spawning in 5 minutes is always weird
  5. Practice your kill switch — If you’ve never used it, it probably doesn’t work
  6. Sleep doesn’t scale — If you’re on-call, make sure alerts actually wake you
MetricThresholdAction
Concurrent agents> 10Alert
Agent spawn rate> 5/minuteAlert
Query cost> $10Alert
Query cost> $50Pause
Query cost> $100Stop
Query duration> 10 minutesAlert
Token rate> 50K/minuteAlert
Error rate> 10%Circuit breaker

Agent systems fail in ways that traditional software doesn’t. The combination of autonomy, recursion, and external API dependencies creates novel failure modes.

The only defense is defense in depth: limits at every level, monitoring of every metric, and kill switches that actually work.

Our agent is still in production. It’s better now. And we sleep better too.


Up next: Building Agent Traces — Instrumentation patterns for full visibility into agent execution.