Partech Systems — Agentic AI in production: the demo lies

The demo is always beautiful. Someone types a sentence, the agent thinks out loud, calls three tools, books the flight, files the expense, sends a polite Slack message, and the room claps. I've sat in that room. I've also been the one woken up six weeks later when the same agent quietly double-booked a customer's order because step four returned an empty string and nobody had decided what an empty string meant.

That gap — between the demo and the thing you can actually run for paying customers — is where most of the engineering lives, and it's where the marketing has nothing to say.

Start with the maths, because the maths is unforgiving. Say each step in your agent's chain is 95% reliable. Sounds fine. Now chain ten of them. 0.95 to the tenth power is about 0.60. Your beautiful ten-step workflow succeeds three times in five. Chain twenty steps and you're flipping a coin. Errors don't add, they compound, and an agent that plans its own steps will happily generate more steps than you expected. The single most common failure I see isn't a model "hallucinating" — it's a perfectly reasonable chain that was simply too long to survive its own length.

So the first instinct people reach for is exactly the wrong one. "Give the agent more autonomy. Let it figure it out." No. More autonomy means longer chains, more branching, more places for a 95% step to bite you, and far less ability to reason about what the thing will do before it does it. The teams shipping agents that actually hold up in production are doing the opposite. They're shortening the leash. Fewer steps. Narrower tool sets. Explicit decision points where a human, or at least a deterministic check, gets to say yes before anything irreversible happens.

That's the bit that earns its keep: guardrails, approval gates, idempotency, rollback. None of it is glamorous and none of it shows up in a launch video.

Approval gates first. Anything that spends money, sends an external message, deletes data, or touches a regulated record should pause and ask. Not because the model is stupid, but because the cost of being wrong is asymmetric. A wrong summary costs you a frown. A wrong refund costs you actual pounds and a furious customer. Gate by consequence, not by confidence — confidence scores from these models are not calibrated and you should stop pretending they are.

Idempotency is the one people forget until it hurts. Agents retry. Frameworks retry. Networks drop and the orchestrator quietly runs your step again. If "create order" runs twice you've created two orders. Every tool an agent can call needs an idempotency key or a dedupe check, the same discipline you'd apply to any distributed system, because that's what this is. An agent is a distributed system with a language model as a very confident, occasionally drunk, scheduler.

Rollback follows from that. When step seven fails, what undoes steps one through six? If the answer is "nothing, we hope it doesn't happen," you don't have a production system, you have a prototype with good PR. Saga patterns, compensating actions, transactional outboxes — the same toolkit we used for payment systems twenty years ago. The model is new. The failure modes are not.

Then there's tool use, which is where the cracks usually show first. The model calls a tool with a malformed argument. The tool returns an error the model has never seen and the model, being a pattern matcher, invents a plausible recovery that's completely wrong. Or it calls the right tool with stale data because it forgot what it retrieved four steps ago. Tools need strict schemas, validation at the boundary, and error messages written for the model to read — terse, structured, telling it exactly what to do next. Treat your tool layer like a public API being hit by an enthusiastic intern, because functionally it is.

And none of this is manageable without observability and evals, which I'll be blunt about: they are not optional, and if your team treats them as a phase-two nicety you will be debugging in production with print statements and prayer. You need to trace every step — inputs, outputs, tool calls, latencies, the lot — because when something goes wrong at step twelve of a forty-step run, "the agent messed up" is not a diagnosis. And you need evals that run a representative set of real tasks on every change, so you know whether last night's prompt tweak fixed one thing and broke nine others. Models drift, prompts rot, a provider updates something on their end and your success rate quietly drops four points. Without evals you find out from the customer.

So where do agents genuinely pull their weight today? In bounded, well-instrumented loops where the cost of error is low or fully reversible. Drafting, triage, code that a human reviews, research with citations you can check, classification, routing. Short chains, tight tools, a human in or near the loop. That's not a consolation prize — that's a large and genuinely valuable surface, and we ship it.

Where are they still a liability? Long autonomous chains touching money, irreversible external actions with no gate, anything where you can't explain after the fact why it did what it did. Putting an unsupervised agent on those is not innovation, it's a future incident with a launch date.

My honest read after building these for real customers: the technology is real and the value is real. But the engineering discipline around it lags the marketing by about two years. The vendors are selling you 2028's autonomy on 2026's reliability. Build for the reliability you actually have. Short chains, hard gates, idempotent tools, traces on everything, evals on every change. Do that and agents earn their place. Skip it and you're just automating your incidents.

Agentic AI in production: the demo lies

More from the journal.

Small models, big deal

RAG done right: it's a search problem, not an AI problem