Anatomy of an Agent Failure: A Post-Mortem

Dec 2, 2024

Incident Date: October 15, 2024 Duration: 47 minutes Cost: $8,412 Root Cause: Missing recursion limits Status: Resolved

This is the post-mortem of an agent failure that cost us five figures and several hours of sleep. We’re publishing it because these failures are common, under-documented, and preventable.

The Setup

We built an AI research agent for competitive intelligence. Users could ask questions like “What are the key features of [competitor]‘s enterprise plan?” and the agent would:

Search the web for relevant information
Read and parse web pages
Synthesize findings into a structured report
Spawn sub-agents for deep dives on specific topics

The architecture looked like this:

Orchestrator Agent
    ├── Web Search Tool
    ├── Page Reader Tool
    └── Deep Dive Agent (spawnable)
            ├── Web Search Tool
            └── Page Reader Tool

We were proud of it. Users loved the depth of analysis. Then came October 15th.

The Trigger

At 3:47 AM UTC, a user submitted a query:

“Comprehensive analysis of the AI observability market landscape including all major players, their features, pricing, and market positioning”

This was an ambitious query. The orchestrator correctly identified it as needing deep analysis and began spawning sub-agents.

The Cascade

Here’s what the logs showed:

3:47:12 — Orchestrator identifies 12 major players in the space 3:47:34 — Spawns 12 Deep Dive Agents, one per competitor 3:48:01 — Deep Dive Agents begin researching 3:49:45 — Agent #3 discovers “related tools” and decides to spawn sub-agents 3:50:12 — Agent #7 does the same 3:52:30 — 47 total agents running 3:55:00 — 156 agents running 3:58:00 — API rate limits hit, requests backing up 4:02:00 — Queue depth: 3,400 pending requests 4:15:00 — PagerDuty alert fires (30 minutes late due to alert misconfiguration) 4:23:00 — On-call engineer sees alert, begins investigation 4:34:00 — Manual kill switch triggered 4:34:00 — Final agent count: 312 spawned, 47 still active

The Bill

Resource	Usage	Cost
GPT-4 tokens	2.1M	$6,300
Claude tokens	890K	$1,340
Web search API	12,400 calls	$620
Embedding API	45K calls	$90
Infrastructure	47 min burst	$62
Total		$8,412

What Went Wrong

1. No Recursion Limits

The Deep Dive Agent could spawn more Deep Dive Agents. There was no depth limit. This was a conscious “feature” decision—we wanted agents to be able to do thorough research. We didn’t anticipate this failure mode.

The fix: Maximum agent depth of 2. The orchestrator (depth 0) can spawn Deep Dive Agents (depth 1), but those cannot spawn further agents.

2. No Total Agent Limit

Even with depth limits, breadth was unbounded. An agent could spawn 100 sub-agents if it identified 100 sub-topics.

The fix: Maximum 10 concurrent agents per query. Queue additional spawns and process sequentially.

3. No Per-Query Budget

Each agent had a token budget, but there was no aggregate budget for the entire query. 12 agents × 100K tokens each = a very big number.

The fix: Per-query budget of 500K tokens. If the query needs more, it should checkpoint and ask the user to continue.

4. Delayed Alerting

Our cost alerts were set at the account level, not per-query. The query burned $8K before any alert fired.

The fix: Per-query cost thresholds. Alert at $10, pause at $50, hard stop at $100.

5. No Circuit Breakers

When API rate limits hit, agents kept retrying with exponential backoff. This extended the incident duration because agents weren’t failing fast.

The fix: Circuit breaker pattern. After 3 failures in 10 seconds, stop trying and report failure.

6. Incomplete Tracing

Our logs captured tool calls but not agent spawning decisions. We couldn’t easily see why Agent #3 decided to spawn sub-agents.

The fix: Log reasoning steps, not just actions. Every spawn decision should include the reasoning that led to it.

The Trace We Wish We Had

What we could reconstruct:

[3:47:12] Orchestrator spawned
[3:47:34] Sub-agent spawned (id: abc123)
[3:47:35] Tool call: web_search("competitor A features")
[3:47:38] Tool result: 2,400 characters
...

What we wish we had:

[3:47:12] Orchestrator spawned
[3:47:12] Reasoning: "Query requests comprehensive analysis. Identifying major players..."
[3:47:30] Decision: Spawn 12 sub-agents for deep dive
[3:47:30] Reasoning: "12 competitors identified. Each needs dedicated research."
[3:47:30] Budget check: 500K available, allocating 40K per sub-agent
[3:47:34] Sub-agent abc123 spawned (depth: 1, budget: 40K, parent: orchestrator)
[3:49:45] Sub-agent abc123 reasoning: "Found 5 related tools. Should I spawn sub-agents?"
[3:49:45] Sub-agent abc123 decision: Spawn 5 sub-sub-agents
[3:49:45] ⚠️ LIMIT HIT: Max depth exceeded, denying spawn request

This trace would have shown us the cascade in real-time and enabled automatic intervention.

The Kill Switch We Should Have Had

Our kill switch was a manual process:

SSH into server
Find running processes
Send SIGTERM
Hope cleanup handlers work

Our kill switch now:

Single API call: POST /emergency/stop?query_id=xxx
All agents receive immediate termination signal
In-flight API calls are cancelled
Partial results are saved
User is notified with explanation

Also: a big red button in the admin dashboard.

Lessons Learned

Recursive systems need hard limits — Even if you trust your agents, bound the recursion
Budget at every level — Per-token, per-agent, per-query, per-user, per-day
Trace decisions, not just actions — You need to understand why things happened
Alert on anomalies, not just thresholds — 50 agents spawning in 5 minutes is always weird
Practice your kill switch — If you’ve never used it, it probably doesn’t work
Sleep doesn’t scale — If you’re on-call, make sure alerts actually wake you

What We Monitor Now

Metric	Threshold	Action
Concurrent agents	> 10	Alert
Agent spawn rate	> 5/minute	Alert
Query cost	> $10	Alert
Query cost	> $50	Pause
Query cost	> $100	Stop
Query duration	> 10 minutes	Alert
Token rate	> 50K/minute	Alert
Error rate	> 10%	Circuit breaker

Conclusion

Agent systems fail in ways that traditional software doesn’t. The combination of autonomy, recursion, and external API dependencies creates novel failure modes.

The only defense is defense in depth: limits at every level, monitoring of every metric, and kill switches that actually work.

Our agent is still in production. It’s better now. And we sleep better too.

Up next: Building Agent Traces — Instrumentation patterns for full visibility into agent execution.