The shiny demo isn't the hard part. The hard part is making the agentic system reliable enough to actually trust with real work, every day, for months on end. That reliability layer is where most teams fall down — not because they don't know it matters, but because the work is unglamorous and easy to defer.
This is a list of what's actually required, in our experience of moving agentic systems from prototype to production. It's not an exhaustive checklist. It's the set of things I'd want any team to have in place before they let an agent loose on real customer or operational work.
Observability — every agent decision logged
The first thing most teams underbuild. Production agentic systems make a lot of decisions per task: which tool to call, what arguments to pass, how to interpret intermediate results, when to escalate, when to retry. If the system fails — and it will, at some point — you need to be able to reconstruct what happened.
Concretely: every model call should log the inputs, the prompt, the tools available, the tool that was called (if any), the arguments, the response, and the next state. Every agent transition should log the trigger and the outcome. Every escalation to a human should log why.
This isn't a nice-to-have. When the system produces a bad output for a specific customer, you need to be able to trace back exactly what the system was thinking, what data it had access to, and where the reasoning went wrong. Without observability, debugging is guesswork. With it, debugging is mechanical.
Most teams skip this in the prototype phase because the prototype is small and the failures are visible. By the time the system is at production volume, retroactively adding observability is expensive and incomplete.
Logging and audit trail — separate from observability
Observability is for engineers debugging the system. The audit trail is for the business — for compliance, for customer disputes, for regulatory questions, for recovery from incidents.
The audit trail should contain enough information that, if a customer queries an output the agent produced, you can show them exactly what data was used, what prompts were run, what version of the system produced the output, and what human (if any) reviewed it.
This matters more in some industries than others. In ours — running audit reports for clients, producing strategic recommendations — the audit trail is genuinely important. We've had customers ask, six months after the fact, what data went into a specific finding. The audit trail let us answer cleanly.
In agencies and services businesses, the audit trail is also a defence against the implicit liability of replacing senior humans with agentic systems. If something goes wrong, the audit trail demonstrates due care.
Error handling and retry logic — defined, bounded, reviewed
Agents fail in a variety of ways. The model can return malformed output. A tool call can fail. An external API can rate-limit. A request can time out. The agent can produce a response that the system can't parse.
Each of these needs a defined behaviour. The version of "default behaviour" that most prototypes ship with — retry until something works — is fine for prototypes and dangerous in production. Aggressive retry can run up significant cost. It can also turn a temporary issue into a sustained failure pattern that's harder to debug than a clean failure would have been.
The retry logic in production agentic systems should be:
Bounded. A maximum number of retries per agent call, per workflow stage, per task. Beyond the bound, the task halts and escalates rather than continuing to retry.
Backed off. Exponential backoff between retries, particularly for external API calls. Don't hammer a rate-limited API. Don't fire 50 model calls in 30 seconds because the first call failed.
Distinguished by error type. Retry behaviour should depend on the error. A rate limit deserves backoff. A malformed response from the model deserves a single retry with a slightly different prompt. A tool call that returns a logical error deserves a different recovery path entirely.
Audited. When retries happen, log them. A workflow that's silently retrying ten times per call is fine until it isn't. You want to know about the retry rate before it surprises you.
Escalation paths — explicit, not chat-based
When the agent can't proceed, what happens? Most prototype systems handle this by, effectively, failing silently or producing a half-finished output. Production systems need an explicit escalation path.
Escalation should be structured. The agent doesn't just stop and wait for a human. The agent flags what it can't do, why, what context it has gathered, and what the human needs to provide to unblock the workflow. The human gets a specific question with specific context, not a generic "this failed".
In our pipelines, escalation comes via a dedicated review queue with structured tickets. Each ticket has the agent's state at the point of escalation, the question that needs answering, and the context the human needs to make the decision. The human responds with a structured input. The agent picks up where it left off.
This is more work to build than "send an email when something fails". It's worth the work. The alternative — humans having to dig through logs to understand what an agent was trying to do — eats more time than the original automation saved.
Quality assurance loops — agents checking agents
For production agentic work that goes to customers or affects business decisions, the output needs verification. Verification by humans is expensive and slow. Verification by other agents is fast and scalable, and surprisingly good if designed carefully.
Our pipelines have a QA agent that runs after the workflow agents have produced their output. The QA agent is given the original brief, the workflow's intermediate state, and the final output, and asked to identify errors, inconsistencies, hallucinations, and quality issues. It produces a structured report.
A correction agent then reads the QA report and addresses each item. Then the QA agent runs again. The loop runs until the QA agent has no more issues to flag, or until a maximum iteration count is reached.
This catches a meaningful proportion of the errors that would otherwise have shipped. It's not perfect. Agents checking agents will sometimes miss things, will sometimes flag things that aren't actually wrong, and will occasionally produce loops where two agents disagree about a finding. But the proportion of errors caught is much higher than what either humans-only or no-QA versions catch.
The QA agent shouldn't be the same model or the same setup as the workflow agents. Different prompts, different framing, sometimes different models. Diversity of perspective catches things that one perspective wouldn't.
Cost monitoring — real-time, with circuit breakers
Agentic systems can run up significant cost quickly if something goes wrong. A bad retry pattern, an unexpected loop, an aggressive prompt that consumes more tokens than budgeted — any of these can produce a single task that costs 5-10x the average.
Production systems need real-time cost monitoring with explicit budgets per task and per workflow. Beyond the budget, the system halts. The halt is loud — the operator knows immediately, not at the end of the month when the bill arrives.
We had an early version that didn't have this. One audit ran 30x the average cost because of a particular interaction pattern we hadn't anticipated. We caught it the next day, when looking at metrics. The current version has per-audit cost ceilings and would have halted that one before it consumed the unexpected budget.
Human-in-the-loop interfaces — designed, not bolted on
If humans are reviewing agent output, the interface they review through matters. Most teams use whatever default exists — usually a Slack ping, an email, or a web dashboard that shows the raw output.
The interfaces that work best in our experience are purpose-built. The reviewer sees the agent's output formatted appropriately for the work, with the agent's reasoning and uncertainty surfaced where relevant, with one-click ability to approve, reject, or escalate further. The interface should make the human's job as fast as possible, because the human is the bottleneck once the agent's work is done.
A good rule of thumb: if your reviewer is opening multiple tabs or copying things between systems to do the review, the interface needs more work.
The cultural piece
A note worth making, because it's harder than the technical pieces.
Production agentic systems require a different operational discipline than most teams have built. The work is more like running a service than running a project. There's monitoring, alerting, on-call, incident response, post-mortems. The team that built the system has to stay engaged with it in production.
Most marketing or operational teams haven't operated like this. The shift from "we built a thing and shipped it" to "we operate a service that runs every day" is a cultural transition. Some teams make it well. Some don't. The teams that don't make it produce systems that work for two months and then degrade — exactly the chatbot-shaped failure pattern we've written about — not because the technology failed, but because no one was watching it.
If you're moving agentic work to production, this cultural piece deserves explicit attention. The right operating model is closer to a small SRE team than to a typical project team. The team needs the time and the mandate to actually operate the systems, not just to build them.
Closing
None of the above is exotic. It's mostly engineering hygiene applied to a new class of system. The reason it's worth being explicit about is that the gap between a prototype that works and a production system that's reliable is, in our experience, ten times more work than most teams expect.
The teams that capture the leverage of agentic systems are the teams that make the investment in this layer. The teams that don't, build prototypes that demo well and never become operationally trustworthy. The technology is the same. The discipline isn't.
If you're building an agentic system right now and you've got it working in dev, the question I'd push you on isn't "is the agent doing the right thing". It's "what would happen if this ran 1,000 times overnight on real work". If the answer involves a lot of hand-waving, the production layer needs more attention before you let it loose.