Endstation

The UMM Framework

The Ultimate Macro Machine

A micro-agent factory for workflows that can't afford to fail silently.

An Ultimate Macro Machine (UMM) is an AI architecture pattern that decomposes a complex workflow into specialized micro-agents, each with a single job, known inputs, known outputs, and isolated failure modes, connected by deterministic code. Endstation developed the UMM framework for production AI in regulated industries where reliability, auditability, and compliance matter more than autonomous behavior.

One agent, twelve jobs, compounding failure

Most AI failures in production aren't model failures. They're architecture failures. A single agent asked to do twelve things will do eleven of them reasonably and one of them wrong, and you won't know which one until a customer calls.

UMMs invert that. We decompose the workflow first, then build a small specialized agent for each step, then chain them with deterministic code you can read, test, and audit. Each agent has exactly one job. If something breaks, it breaks in a single place, with a logged input and a logged output, and a human can see exactly why.

The industry default for agentic AI is still what Google Research calls the "bag of agents" pattern: hand a single LLM a long task, a handful of tools, and some instructions, and let it loop until it thinks it's done. This works beautifully in demos. It fails in production for two compounding reasons.

01

Errors multiply across steps

If a single agent has 95% accuracy on each sub-step of a ten-step workflow, the end-to-end accuracy is not 95%. It's 0.95^10, about 60%. Stretch it to twenty steps and you're below 40%. Research from Wand.ai puts the token-level version of the problem in starker terms: a 1% per-token error rate escalates to an 87% probability of error by the 200th token. Long-horizon autonomous agents are fighting exponential decay, not a fixed error budget.

02

Hallucinations don't stay local

In traditional single-turn LLM use, a hallucination is a bad sentence. In an agent, it's a bad sentence that becomes the input to the next tool call, which becomes the input to the next decision, which becomes part of the output you ship to a customer. Recent survey work on agent hallucinations calls this "full-chain error propagation": errors that span multiple steps and accumulate and amplify over time, in ways that are fundamentally harder to debug than single-model errors.

Both problems get worse the more the agent is allowed to do on its own.

What the research says

More agents is not the answer either

The obvious reaction to single-agent fragility is to throw more agents at the problem. Specialized agents, debating agents, judge agents, critic agents. The premise is that collective reasoning beats individual reasoning.

The 2026 research does not support this as a general rule.

Google Research's Towards a Science of Scaling Agent Systems tested five agent architectures across four benchmarks and found that adding agents can drive massive gains on parallelizable tasks but often leads to diminishing returns, or performance drops, on sequential workflows. Multi-agent systems improved financial reasoning accuracy by 80.9% over a single-agent baseline, and simultaneously degraded planning performance by 70% on the same class of problems.

A separate information-theoretic analysis published in February 2026 showed that scaling homogeneous agents exhibits strong diminishing returns: accuracy improves at small agent counts, but the marginal gain per additional agent rapidly collapses toward zero. Practitioner-facing writeups of this work put the saturation point at around four agents, after which additional agents contribute little.

And in tool-heavy environments (the exact kind most operations workflows live in), multi-agent systems suffer a 2–6× efficiency penalty compared to single agents once the tool count exceeds ten. The coordination overhead eats the specialization gains.

The takeaway is not "don't use multiple agents." The takeaway is that topology dominates quantity: how agents are organized and what glues them together matters more than how many of them there are. Bags of agents without structure don't just stop improving; they can actively amplify errors. The same body of research identified error-amplification factors as high as 17× in unconstrained multi-agent setups.

The UMM Approach

Decompose first. Specialize second. Glue with deterministic code.

A UMM is not a smarter agent. It's a smaller one, repeated, with rigid connective tissue between instances.

Every UMM engagement starts the same way: we map your workflow into its smallest useful units of work. A submission arriving in an inbox isn't one task. It's receive-the-email → classify-the-document → extract-the-fields → validate-against-schema → route-to-the-right-queue → notify-the-handler. Six jobs, not one. Some require judgment. Most don't.

Then we build a specialized micro-agent for each job that genuinely needs intelligence: classification, extraction with ambiguous layouts, exception handling, decisions where the rules can't be fully written down. Everything else gets deterministic code: validators, schema enforcers, routers, retries, loggers, integration calls.

The result is an architecture where every step has:

  • One job. The agent is prompted, tooled, and evaluated against a single task.
  • Known inputs and outputs. Typed, validated, and logged at every boundary.
  • Isolated failure. If the extractor breaks, the router doesn't. The upstream deterministic validator catches bad output before it propagates.
  • A human checkpoint where it matters. Not after every step (that defeats the automation), but at the decisions your compliance team actually needs to sign off.

This is the opposite of the autonomous-agent pattern. Our agents don't decide what to do next. They do the thing they were built to do, return a structured result, and let the orchestration layer (plain code) decide what happens next.

Probabilistic where you need judgment.
Deterministic everywhere else.

The central design question in every UMM is the same: which steps need a model, and which steps need a function?

The honest answer is: fewer steps need a model than most vendors will tell you.

Parsing a structured PDF with a known schema? That's code. Validating an NPI number against a registry? Code. Routing a document to the right queue based on classification output? Code. Computing a ratio, checking a date range, enforcing a business rule, retrying a failed API call, writing to an audit log? All code.

What actually needs a language model is the handful of steps where inputs are unstructured or ambiguous and the rules can't be fully enumerated in advance: document classification when the sender format varies, information extraction from mixed-quality scans, decision recommendations where precedent matters more than policy, and exception-handling where something novel has shown up that nobody planned for.

Putting a model at those points, and only those points, is what makes UMMs reliable. The deterministic code around each agent is what makes the whole system auditable. Your compliance team can read it. Your engineering team can test it. Your ops team can monitor it.

When something goes wrong (and things go wrong), the failure is contained to one agent, logged with its exact input, and either retried, escalated to a human, or routed down an alternate path, deterministically.

Anatomy of a UMM

What's inside a single micro-agent

Every micro-agent in a UMM is built to the same spec, regardless of what it does.

A specialized extraction micro-agent running against NYC ACRIS records: one job, one contract, one logged result.

01

A bounded task definition

Written as if you were handing it to a new employee on their first day. No scope creep.

02

A typed input contract

Pydantic schema, JSON schema, or equivalent. Inputs that don't match the contract don't reach the model.

03

A typed output contract

Structured output, validated before it leaves the agent. Agents don't return prose; they return objects.

04

An eval harness

Representative inputs with known-good outputs. We run it before deployment and on every change. If eval accuracy drops, the change doesn't ship.

05

A telemetry surface

Token counts, latency, model, prompt version, tool calls, and full input/output logged to a store your team can query.

06

A fallback

What happens when the agent returns low-confidence output, when the model is down, or when a tool call fails. Usually: escalate to a human with the full context attached.

This is not glamorous work. It's the difference between a demo and a production system.

When a UMM fits

Not every workflow should be a UMM

We've turned down engagements where the honest answer was "your workflow doesn't need AI." A UMM is the right tool when the workflow has most of these properties.

High volume, repetitive structure

The same kind of work, done many times, with variations that matter.

Mixed structured and unstructured inputs

PDFs, emails, portal exports, handwritten forms: the kind of input a pure ETL pipeline can't handle.

Clear decision points

Not "figure out what to do." Specific, nameable decisions: Is this a renewal or a new submission? Does this document contain PHI? Which queue does this belong in?

An auditable outcome requirement

Regulated industry, compliance oversight, or internal QA that will ask "why did the system do this?" and expect a real answer.

Human judgment where it matters

Exception handling, final approval, or decisions with real downside remain with people. The UMM handles the 90% that's mechanical so the humans can focus on the 10% that isn't.

If a workflow is fully deterministic, you don't need AI, you need a script. If it's fully ambiguous and requires end-to-end human judgment, you don't need AI either, you need a person. UMMs live in the middle: the workflows where 80% of the work is pattern-matching and 20% is judgment, and where the current approach is asking people to do all 100% manually.

Discover → Build → Run

From workflow map to running UMM

The UMM framework is the how behind the engagement model we use on every project.

01

Discover

Produces the workflow map and the decomposition: which steps are deterministic, which need a model, and what each micro-agent's contract looks like. Output: a ranked list of UMM candidates with expected ROI.

02

Build

Implements the UMM. Each micro-agent ships with its eval harness, its input/output contracts, its telemetry, and its fallback path. The orchestration layer is deterministic code in a language your team can maintain.

03

Run

Monitors every agent in production. When accuracy drifts, we know before you do. When a new edge case shows up, it gets routed to a human and added to the eval set.

You own the whole thing. Codebase, prompts, evals, infrastructure definitions, and documentation. No runtime dependency on Endstation, no black boxes, no proprietary middleware. If we stop being useful, you fire us and keep running.

Frequently asked

How is a UMM different from a LangGraph / CrewAI / AutoGen setup? +

Those are orchestration libraries: the connective tissue. A UMM is an architectural approach that can be implemented on top of any of them (we most often use a combination of typed Python, LangGraph for stateful flows, and plain function calls where state isn't needed). The distinguishing feature of a UMM isn't the library; it's the discipline of decomposition, typed contracts at every boundary, and deterministic glue code around probabilistic components.

Isn't this just a pipeline with LLMs in it? +

Essentially, yes, and that's the point. The industry spent two years convincing itself that pipelines were outdated and autonomous agents were the future. The research from 2025 and 2026 has mostly walked that back. What we call a UMM is a disciplined, modern pipeline: strongly typed, observable, evaluable, with language models used precisely at the points where they add value and nowhere else.

How many agents are in a typical UMM? +

Most production UMMs we've built use between three and seven specialized micro-agents, with the rest of the workflow handled by deterministic code. The research on diminishing returns past ~4 agents is one reason we keep the count low; the other is that every additional agent is another set of prompts, evals, and failure modes to maintain. Fewer, more focused agents age better.

What if the workflow needs end-to-end reasoning across steps? +

Some workflows genuinely need a reasoning step that sees the whole picture, usually at the decision point, not throughout. We handle this with a dedicated "judge" agent that receives structured outputs from upstream specialists and makes a single bounded decision. The judge doesn't execute tools or take autonomous action; it returns a decision, and deterministic code acts on it. This preserves the isolation properties that make UMMs debuggable.

Do UMMs work with PHI and regulated data? +

Yes. All the architectural properties that make UMMs reliable (typed boundaries, full logging, isolated agents, deterministic orchestration) are the same properties compliance teams require. We build on Azure OpenAI with zero-data-retention, HIPAA-trained engineers, and audit logging at every agent boundary. The UMM pattern is specifically useful for regulated workloads because every step in the chain is inspectable.

Can we see one running? +

Most of our production UMMs are under NDA. The closest public artifact is chat.endstation.ai, the workbench we built for our training sessions, which uses the same deterministic-orchestration-around-model-calls pattern internally. It's not a full UMM, but it's the right mental model.

Version 1.2 — April 2026