Architecture

AI agents in production: the demo worked, the deploy didn't

May 16, 2026 | 10 min read

Server room with monitoring dashboards showing agent telemetry

The demo of your AI agent ran for 30 seconds, called three tools, and booked a meeting on the founder's calendar. The board clapped. Production shipped two weeks later. Within the first 1,000 runs, the agent had double-booked four customers, retried the same API call 14 times in a loop, burned through $6,000 in token costs, and confidently emailed someone the wrong contract.

This isn't a hypothetical. It's the shape of every agent rollout we've seen in the past year. The technology works. The deployment patterns most teams use don't. Here's the gap, and how to close it.

What breaks between demo and 1,000 runs

Demos hide failure modes that production exposes. Five of them dominate.

Tool-call loops

The agent calls get_user, gets back an error, calls it again with the same args, gets the same error, and repeats until your retry budget runs out. We've seen agents make 40+ identical tool calls because the model couldn't reason its way out of a transient API failure. Each loop costs tokens. Each loop adds latency. Each loop pushes the agent further from its original goal as the context window fills with error messages.

Context window overflow

A multi-step task with five tool calls, RAG retrieval, and a long system prompt eats 30,000 tokens before the agent does any real work. On a real task with 15 steps, you'll hit the 200K context limit halfway through. The agent doesn't fail loudly; it silently drops the earliest part of the conversation, which is usually the user's actual request.

Confidently wrong outputs

The agent returns structured JSON that passes schema validation, contains plausible field values, and is completely wrong. The customer's ID is a hallucination. The meeting time is in the wrong timezone. The contract amount is off by one decimal. Schema validation catches type errors; it doesn't catch reality errors.

Non-deterministic costs

The same input produces a 4-step run on Monday and a 14-step run on Tuesday. Costs vary 5x between runs. Your unit economics assumption ("$0.50 per agent call") is a fiction because the agent sometimes spirals. Without a per-run budget cap, one bad task can cost more than 100 normal ones.

Silent prompt drift

Someone updates the system prompt to fix one edge case. The fix breaks three other behaviors nobody tested. You find out from a customer ticket six days later. Without an eval suite running on every change, prompt edits are blind code changes with no test coverage.

The real cost of running an agent

The headline price per million tokens is misleading. A working agent burns more on context replays, retries, and the long thinking traces enabled in Claude and GPT-5. Here's the math from three production deployments we've shipped or audited in the past six months.

Use case	Avg tool calls	Cost per run	Monthly @ 10K runs
Doc Q&A (read-only)	2-3	$0.04	$400
Sales research agent	6-10	$0.40	$4,000
Booking / CRM updates	8-14	$0.90	$9,000
Multi-step ops agent	15-30	$3.80	$38,000

Those numbers assume the agent isn't looping. With a 5% loop rate (low for a fresh deployment), real costs run 30-50% higher. Budget for the failure modes; they're not edge cases.

The deployment pattern that works

Across the deployments that survive past the first month, the same six pieces show up.

1. Per-run budget caps

Every agent run has a hard token budget, a tool-call cap, and a wall-clock timeout. When any one trips, the run halts and escalates. This single change stops the worst-case spend overnight. Set the budget at 2x the median run; tune from there.

2. Idempotent tools, not chatty ones

Replace create_booking with create_booking(idempotency_key). Replace send_email with send_email(dedupe_window=5min). When the agent retries (it will), it retries safely. Most production incidents we've debugged came from non-idempotent tools called twice.

3. Human-in-the-loop for the irreversible

For any tool that costs money, contacts a customer, or writes to a system of record, surface a confirmation step. The agent prepares the action; a human approves it. The UX is "review the booking, click confirm." This caps the blast radius of every failure mode listed above. You can automate the boring 90% and keep humans on the 10% that matters.

4. Eval suites on every prompt change

Build a fixture of 50-200 representative inputs with expected outputs. Run it on every prompt update, every model upgrade, every tool schema change. Grade structured fields deterministically and free-form text with an LLM judge. Without this, your prompt is untested code.

5. Observability you can actually read

Log every tool call, every model response, every retry, with a trace ID that ties them to one user request. Tools like Langfuse, Braintrust, or a custom OpenTelemetry pipeline pay for themselves the first time you debug a customer complaint. Without traces, "why did the agent do that?" is unanswerable.

6. A smaller graph, not a smarter agent

The teams shipping most reliably break the "one big agent" into a small graph of focused agents, each with 3-5 tools and a tight scope. A "researcher" agent gathers context; a "writer" agent drafts; a "validator" agent checks. Each step is testable on its own. Failures localize. The model has less to reason about per step, which collapses loop rates and cost variance.

When to use an agent vs. a workflow

Not every problem needs an agent. The fastest way to ship is often a deterministic workflow with one LLM call at each decision point. Agents earn their cost when the path is genuinely unknown ahead of time. For tasks with a fixed shape, a workflow is cheaper, faster, easier to test, and easier to explain to a customer.

Pattern	Use when	Cost shape
Workflow	Steps known in advance	Fixed, predictable
Graph of small agents	A few branching paths	Bounded, testable
Single autonomous agent	Truly open-ended research	High variance

At Savi, the default for client work is "workflow first, agent only where it pays for itself." It's not always sexy, but it's what survives the second invoice. If you're already over budget on an agent that burns more LLM spend than expected, the fix is almost always cutting it down to a workflow with one or two judgment points, not adding more tools.

Ship the boring version first

The agent demos that go viral are the ones that book a flight in 90 seconds with no human in the loop. The agents that bill a real customer next month are the ones with budget caps, idempotent tools, a confirm button, and an eval suite. Build the boring version first. Add autonomy where the eval suite proves it survives.

Frequently asked questions

Why do AI agents fail in production?

Three reasons dominate: tool-call loops where the agent keeps retrying the same step, context windows that overflow on long tasks, and confidently wrong outputs that pass validation but fail in the real world.

How much does running an AI agent cost?

A single multi-step agent run costs $0.40-$4.00 depending on the model, the number of tool calls, and the size of the context. A product handling 10,000 agent runs per month pays $4,000-$40,000 in inference alone.

Do AI agents need human oversight?

For any action that costs money, sends a message, or writes to a system of record, yes. Best practice today is human-in-the-loop for high-stakes actions and audit logs for everything else.

How do you test an AI agent?

Eval suites with 50-200 representative inputs, deterministic graders for structured outputs, and LLM-as-judge for free-form responses. Re-run on every prompt change and every model version.

Shipping an agent that needs to survive production?

We design and harden agent systems for paying customers, not demos. 30-minute call, no sales pitch.

Talk to our team