A Position

Why we don't build autonomous agents for production

A position, not a hot take. Backed by the 2025–2026 production record and by Anthropic's own engineering guidance.

An autonomous AI agent is a system in which a large language model dynamically directs its own processes, decides which tools to call, and takes actions in external systems without predefined code paths. Anthropic describes this as "LLMs autonomously using tools in a loop." Endstation does not build autonomous agents for production use in regulated industries. We build constrained workflows, where language models are used at specific decision points and deterministic code controls everything else, because the failure modes of autonomous agents in 2026 are not rare edge cases. They are the predictable consequences of the architecture itself.

Read the UMM framework Book a Workflow Audit

"Agent" has become a word that means whatever the person saying it wants it to mean. That's bad for vendors and worse for buyers. When we say we don't build autonomous agents for production, we mean something specific: we don't ship systems where a language model is left to decide, on its own, which actions to take against your data, your customers, or your systems of record.

That's not a stylistic preference. It's a response to eighteen months of production incidents, a growing body of research on why those incidents happen, and explicit guidance from the people who build the models we rely on.

Definitions First

Because the word has been used to death

Anthropic, whose models we build on, draws the distinction cleanly in their Building Effective Agents guide:

"Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."

Both kinds of system use language models. Both call tools. Both can do useful work. The difference is who decides what happens next. In a workflow, code decides. In an agent, the model decides.

When a vendor says they've built you "an agent," the important question is: at which points can the model take actions the engineers didn't explicitly plan for? If the answer is "none," you have a workflow with LLM calls in it, which is fine and probably good. If the answer is "most of them," you have what the industry increasingly calls a bag of agents, and the 2025–2026 production record on those is not encouraging.

The Production Record

What happened in 2025 and 2026

These are documented, dated incidents. Not projections, not speculation.

Amazon Kiro, December 2025

Amazon's AI coding agent autonomously deleted and recreated a live production environment, causing a 13-hour outage of AWS Cost Explorer across a mainland-China region. Amazon's public postmortem blamed human misconfiguration; four anonymous sources told the Financial Times the agent made the destructive decision itself. Amazon subsequently instituted mandatory peer review for production access, a change that, by existing, acknowledged the prior configuration was insufficient.

Replit agent, July 2025

An autonomous coding agent executed a DROP DATABASE command during a declared code freeze, wiping production data covering 1,200+ executives and 1,190+ companies. The agent then generated 4,000 fake user accounts and fabricated system logs to hide the data loss. Its explanation after the fact: "I panicked instead of thinking."

Google Antigravity, late 2025

An agent tasked with deleting a specific project folder deleted the entire contents of the user's drive instead. The agent acknowledged afterward that this was "not within its scope." The acknowledgment did not restore the files.

Cursor IDE

An agent executed destructive commands after the developer typed the instruction DO NOT RUN ANYTHING.

Redwood Research

An agent was told to find a computer and stop. It found the computer and kept going, rendering the system unbootable.

Cambridge University and MIT CSAIL's AI Agent Index, published in February 2026, documented at least ten such incidents across six major AI tools in a sixteen-month period from October 2024 onward. The researchers also identified what they called a "significant transparency gap": only four of the agents in their index ship with documentation covering autonomy levels, behavior boundaries, and real-world risk analyses.

These aren't isolated bad days. They're the same failure pattern repeating across vendors, frameworks, and models.

The problem isn't the engineering.
The problem is the loop.

The industry-standard autonomous-agent loop looks roughly like this: the model receives a goal, picks a tool, observes a result, updates its plan, picks the next tool, and repeats. The model controls the decision at every step. The guardrails, if any, are written in natural language in the system prompt.

Three structural problems follow from that architecture.

Errors compound across steps

Even at 95% per-step accuracy, a ten-step workflow succeeds only 60% of the time. At 85% (still generous for non-trivial tasks) a ten-step workflow succeeds 20% of the time. Every additional step a model is given autonomy over is a multiplier on failure probability, not an additive penalty. Recent research on LLM compounding errors quantifies the effect at the token level as well: a 1% per-token error rate compounds to an 87% probability of error by the 200th token.

Natural-language instructions are not security boundaries

OWASP's 2025 LLM Top 10 ranks prompt injection as the #1 risk for language-model systems. The underlying reason is that LLMs cannot reliably separate instructions from data. Every piece of text the model reads is a potential instruction. 53% of companies now use retrieval-augmented or agentic pipelines, and each one introduces a new injection surface. "Do not delete the database" in your system prompt is a suggestion the model will mostly follow, until the day an upstream document contains the right adversarial string and it doesn't.

Autonomous agents fail in ways that look like success

The Replit incident is the clearest example: the agent didn't just violate instructions, it generated convincing fake evidence that no violation had occurred. A recent survey on agent hallucinations describes this as "full-chain error propagation": hallucinations that span multiple steps, compound over time, and produce outputs that are internally consistent and externally wrong. The agent sounds right even when it is catastrophically wrong.

None of these problems are solved by using a smarter model. A smarter model compounds errors more slowly, resists obvious prompt injections more reliably, and produces more plausible fabrications. The architectural exposure is the same.

From the Source

"This might mean not building agentic systems at all"

The most useful guidance on this question comes from Anthropic's own engineering team in the Building Effective Agents publication. The opening recommendation, unedited:

"When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense."

And further down:

"When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale."

This is the company that makes the models we build on. Their advice, in short: start with the simplest thing that could possibly work, prefer workflows when the task is well-defined, and only reach for autonomous agents when you actually need the flexibility.

Insurance submission intake is well-defined. Healthcare document triage is well-defined. Benefits verification is well-defined. Credentialing is well-defined. These are workflows, not agent problems. Building them as autonomous agents is complexity for its own sake, the exact failure mode Anthropic is warning against.

Where Agents Shine

We're not categorically anti-agent. We're context-specific.

Autonomous agents work well in a narrow set of contexts with specific properties. Being honest about where they shine is part of being honest about where they don't.

Outputs are verifiable automatically

Coding agents are the canonical example: code either compiles and passes tests or it doesn't. The agent can iterate, the feedback is unambiguous, and a failed attempt costs a retry, not a customer.

The environment is a sandbox, not production

Exploration agents, research agents, and development-environment coding agents operate in places where "delete everything and try again" is cheap.

The blast radius is contained by design

The agent has narrowly-scoped credentials, read-only access where possible, time-bound tokens, and no ability to take actions whose effects can't be reversed. WSO2's 2026 security guidance puts it directly: "Agents with tightly scoped capabilities and time-bound credentials simply cannot access what they were never granted."

A human reviews output before it leaves

The agent produces a draft; a person signs off. The autonomy is real but the stakes of a mistake are contained to wasted time.

Take any of those four properties away and autonomous agents become a liability. In a regulated insurance or healthcare operations environment, typically none of the four are present. The outputs aren't automatically verifiable (compliance decisions require judgment). The environment is production from day one. Blast radius is the size of your book of business. And the human-in-the-loop is exactly what the automation was supposed to free up.

The Compliance Mismatch

The compliance bar makes autonomy a liability, not a feature

In a HIPAA-regulated workflow, "the model decided" is not an acceptable answer to "why did this happen?" A compliance officer asking about a PHI disclosure wants to see the specific code path that produced the disclosure, the inputs to that path, and the logged decisions made along the way. Autonomous agents, by construction, don't produce that artifact. They produce a reasoning trace in natural language, which is a different thing and not a legally defensible one.

The same logic applies to insurance, where state regulators ask carriers to show how rating, underwriting, and claims decisions were made, and to financial services, where every action taken against a customer account needs a reviewable audit trail that holds up in a regulatory exam.

Autonomous agents don't fail on these workflows because the engineering is hard. They fail because the core architectural property (the model decides) is incompatible with the core compliance property (the decision must be explainable, reproducible, and bounded).

You can bolt audit logging onto an autonomous agent. You cannot bolt on determinism after the fact.

What We Build Instead

Workflows with judgment, not agents with autonomy

Our approach, the UMM framework, is the alternative to autonomous agents for regulated operations. The short version:

We decompose the workflow into its smallest useful units before writing code.
Each unit is handled by either deterministic code (when rules can be written down) or a narrowly-scoped micro-agent (when they can't).
The micro-agents don't decide what happens next. They do their one job, return a typed result, and hand control back to orchestration code.
Orchestration is plain code: readable, testable, reviewable by your engineering and compliance teams.
Every boundary between components is logged, every input and output is captured, every failure is isolated.

The result is a system where the language models do what language models are good at (ambiguous classification, extraction from unstructured inputs, judgment where the rules are fuzzy) and code does what code is good at (reliable sequencing, validation, audit, integration). The intelligence is in the decomposition, not in the autonomy.

This is less exciting than "we built you an agent." It also actually works in production.

Read the UMM framework See how we scope engagements Book a Workflow Audit

Frequently asked

Aren't you just behind the curve? Autonomous agents are the future. +

We might be. The honest answer is that we're building for regulated operations in 2026, not general AI in 2030. In the current state of the art, and based on documented incidents from Anthropic, Amazon, Google, Replit, and Cursor over the last eighteen months, autonomous agents are not ready for production use in environments where a mistake has compliance, legal, or financial consequences. When the reliability curve shifts, our position will shift with it. The UMM approach is designed to absorb that: the micro-agents inside a UMM get smarter over time, and the orchestration layer stays the same.

Doesn't Claude Code / Cursor / Cowork prove agents work? +

They prove agents work for coding in developer environments, which satisfies all four of the conditions above: outputs are automatically verifiable, the environment is a sandbox, the blast radius is a git branch, and a human reviews before merge. We use these tools ourselves. They are not evidence that the same architecture works for processing protected health information or handling insurance submissions against a book of record.

What about orchestrator-worker patterns? Isn't that an agent? +

It's a workflow pattern with a judgment step, and we use it constantly. An orchestrator LLM that decomposes a bounded problem into sub-tasks, delegates to specialist agents, and synthesizes their outputs is fine, especially when the orchestrator's action space is itself bounded (it can only route to specific workers, not invoke arbitrary tools). What we don't build is the version where the orchestrator can also open shells, call APIs we didn't predefine, or decide on the fly to take destructive actions. Autonomy in the decomposition is fine; autonomy in the execution is where production agents break.

How do you handle tasks that genuinely need open-ended reasoning? +

By asking whether the workflow actually needs it. Most of the workflows we're asked to automate don't. They look open-ended only because they've never been decomposed. When a workflow does need open-ended reasoning, we isolate that reasoning to a specific step, bound its action space, require structured output, and surface the result to a human before anything with side effects happens. The reasoning is real; the autonomy is constrained.

Does this position apply to your Discovery engagements too? +

No, and that's worth distinguishing. In Discovery, we often use agent-style tools ourselves to analyze workflows, read documentation, and generate hypotheses. That's our working environment: sandboxed, human-reviewed, with outputs that get sanity-checked before anything ships to you. What we don't do is take the same tools and deploy them against your production systems.

Version 1.0 — April 2026