Anatomy of an Agent Failure: A Post-Mortem
Incident Date: October 15, 2024 Duration: 47 minutes Cost: $8,412 Root Cause: Missing recursion limits Status: Resolved
This is the post-mortem of an agent failure that cost us five figures and several hours of sleep. We’re publishing it because these failures are common, under-documented, and preventable.
The Setup
Section titled “The Setup”We built an AI research agent for competitive intelligence. Users could ask questions like “What are the key features of [competitor]‘s enterprise plan?” and the agent would:
- Search the web for relevant information
- Read and parse web pages
- Synthesize findings into a structured report
- Spawn sub-agents for deep dives on specific topics
The architecture looked like this:
Orchestrator Agent ├── Web Search Tool ├── Page Reader Tool └── Deep Dive Agent (spawnable) ├── Web Search Tool └── Page Reader ToolWe were proud of it. Users loved the depth of analysis. Then came October 15th.
The Trigger
Section titled “The Trigger”At 3:47 AM UTC, a user submitted a query:
“Comprehensive analysis of the AI observability market landscape including all major players, their features, pricing, and market positioning”
This was an ambitious query. The orchestrator correctly identified it as needing deep analysis and began spawning sub-agents.
The Cascade
Section titled “The Cascade”Here’s what the logs showed:
3:47:12 — Orchestrator identifies 12 major players in the space 3:47:34 — Spawns 12 Deep Dive Agents, one per competitor 3:48:01 — Deep Dive Agents begin researching 3:49:45 — Agent #3 discovers “related tools” and decides to spawn sub-agents 3:50:12 — Agent #7 does the same 3:52:30 — 47 total agents running 3:55:00 — 156 agents running 3:58:00 — API rate limits hit, requests backing up 4:02:00 — Queue depth: 3,400 pending requests 4:15:00 — PagerDuty alert fires (30 minutes late due to alert misconfiguration) 4:23:00 — On-call engineer sees alert, begins investigation 4:34:00 — Manual kill switch triggered 4:34:00 — Final agent count: 312 spawned, 47 still active
The Bill
Section titled “The Bill”| Resource | Usage | Cost |
|---|---|---|
| GPT-4 tokens | 2.1M | $6,300 |
| Claude tokens | 890K | $1,340 |
| Web search API | 12,400 calls | $620 |
| Embedding API | 45K calls | $90 |
| Infrastructure | 47 min burst | $62 |
| Total | $8,412 |
What Went Wrong
Section titled “What Went Wrong”1. No Recursion Limits
Section titled “1. No Recursion Limits”The Deep Dive Agent could spawn more Deep Dive Agents. There was no depth limit. This was a conscious “feature” decision—we wanted agents to be able to do thorough research. We didn’t anticipate this failure mode.
The fix: Maximum agent depth of 2. The orchestrator (depth 0) can spawn Deep Dive Agents (depth 1), but those cannot spawn further agents.
2. No Total Agent Limit
Section titled “2. No Total Agent Limit”Even with depth limits, breadth was unbounded. An agent could spawn 100 sub-agents if it identified 100 sub-topics.
The fix: Maximum 10 concurrent agents per query. Queue additional spawns and process sequentially.
3. No Per-Query Budget
Section titled “3. No Per-Query Budget”Each agent had a token budget, but there was no aggregate budget for the entire query. 12 agents × 100K tokens each = a very big number.
The fix: Per-query budget of 500K tokens. If the query needs more, it should checkpoint and ask the user to continue.
4. Delayed Alerting
Section titled “4. Delayed Alerting”Our cost alerts were set at the account level, not per-query. The query burned $8K before any alert fired.
The fix: Per-query cost thresholds. Alert at $10, pause at $50, hard stop at $100.
5. No Circuit Breakers
Section titled “5. No Circuit Breakers”When API rate limits hit, agents kept retrying with exponential backoff. This extended the incident duration because agents weren’t failing fast.
The fix: Circuit breaker pattern. After 3 failures in 10 seconds, stop trying and report failure.
6. Incomplete Tracing
Section titled “6. Incomplete Tracing”Our logs captured tool calls but not agent spawning decisions. We couldn’t easily see why Agent #3 decided to spawn sub-agents.
The fix: Log reasoning steps, not just actions. Every spawn decision should include the reasoning that led to it.
The Trace We Wish We Had
Section titled “The Trace We Wish We Had”What we could reconstruct:
[3:47:12] Orchestrator spawned[3:47:34] Sub-agent spawned (id: abc123)[3:47:35] Tool call: web_search("competitor A features")[3:47:38] Tool result: 2,400 characters...What we wish we had:
[3:47:12] Orchestrator spawned[3:47:12] Reasoning: "Query requests comprehensive analysis. Identifying major players..."[3:47:30] Decision: Spawn 12 sub-agents for deep dive[3:47:30] Reasoning: "12 competitors identified. Each needs dedicated research."[3:47:30] Budget check: 500K available, allocating 40K per sub-agent[3:47:34] Sub-agent abc123 spawned (depth: 1, budget: 40K, parent: orchestrator)[3:49:45] Sub-agent abc123 reasoning: "Found 5 related tools. Should I spawn sub-agents?"[3:49:45] Sub-agent abc123 decision: Spawn 5 sub-sub-agents[3:49:45] ⚠️ LIMIT HIT: Max depth exceeded, denying spawn requestThis trace would have shown us the cascade in real-time and enabled automatic intervention.
The Kill Switch We Should Have Had
Section titled “The Kill Switch We Should Have Had”Our kill switch was a manual process:
- SSH into server
- Find running processes
- Send SIGTERM
- Hope cleanup handlers work
Our kill switch now:
- Single API call:
POST /emergency/stop?query_id=xxx - All agents receive immediate termination signal
- In-flight API calls are cancelled
- Partial results are saved
- User is notified with explanation
Also: a big red button in the admin dashboard.
Lessons Learned
Section titled “Lessons Learned”- Recursive systems need hard limits — Even if you trust your agents, bound the recursion
- Budget at every level — Per-token, per-agent, per-query, per-user, per-day
- Trace decisions, not just actions — You need to understand why things happened
- Alert on anomalies, not just thresholds — 50 agents spawning in 5 minutes is always weird
- Practice your kill switch — If you’ve never used it, it probably doesn’t work
- Sleep doesn’t scale — If you’re on-call, make sure alerts actually wake you
What We Monitor Now
Section titled “What We Monitor Now”| Metric | Threshold | Action |
|---|---|---|
| Concurrent agents | > 10 | Alert |
| Agent spawn rate | > 5/minute | Alert |
| Query cost | > $10 | Alert |
| Query cost | > $50 | Pause |
| Query cost | > $100 | Stop |
| Query duration | > 10 minutes | Alert |
| Token rate | > 50K/minute | Alert |
| Error rate | > 10% | Circuit breaker |
Conclusion
Section titled “Conclusion”Agent systems fail in ways that traditional software doesn’t. The combination of autonomy, recursion, and external API dependencies creates novel failure modes.
The only defense is defense in depth: limits at every level, monitoring of every metric, and kill switches that actually work.
Our agent is still in production. It’s better now. And we sleep better too.
Up next: Building Agent Traces — Instrumentation patterns for full visibility into agent execution.