Anatomy of a real agentic workflow — the amivisible.co audit pipeline, taken apart

I've been writing in fairly abstract terms about what an agentic system is and why it's structurally different from a chatbot. This post is the concrete version of the argument, walking through one specific agentic system we built — amivisible.co, our AI-search-visibility audit pipeline — in enough detail to be useful to anyone considering shipping something similar.

The product is at amivisible.co. The end deliverable is a long-form audit report — typically 40 to 60 pages — covering a brand's visibility across ChatGPT, Perplexity, Google AI Overviews and Copilot, with competitor comparison, share-of-voice analysis, and prioritised recommendations.

The workflow before the agent

Worth describing the manual version first, because the leverage of the agentic version only makes sense in contrast.

The original workflow involved three people for two-to-three weeks.

A senior strategist scoped the engagement, defined the prompt set (typically 30-60 prompts spanning the buyer journey), identified the competitor set, set up the brand-mention classification taxonomy.

A research analyst ran the prompts against the AI engines, captured the raw responses, classified each response (was the brand mentioned, was a competitor mentioned, was a source cited), and built the data exhibits.

A senior strategist drafted the report, with editorial input from another senior. The report went through two rounds of internal QA before going to the client.

Total cost: roughly £4-5k of internal time per audit. Total elapsed time: 15-20 working days. The output was good. The unit economics, at the price we wanted to sell it at, didn't really work. We were essentially trading senior time for client deliverables on margins that weren't compelling.

That's the pre-agent state. Familiar to anyone who's run a knowledge-work consulting line.

The architecture of the agentic version

The agentic version takes the same workflow and reorganises it around a stage pipeline with autonomous handoffs. Five stages, each owned by a specific agent or agent cluster, with a quality-assurance layer that checks the work between stages.

Stage 1 — Scoping. Input: a target brand and its industry context. Output: a structured prompt set, a competitor list, a brand-mention taxonomy. Agent here uses a large model with a long-context window to ingest information about the brand (from its website, public materials, and a small structured questionnaire we capture from the customer at signup), and produces the artefacts that the rest of the pipeline operates on. This stage is the closest in the pipeline to "use a chatbot well" — a single agent producing structured output from context.

Stage 2 — Prompt execution. Input: the prompt set and competitor list from Stage 1. Output: raw responses from each AI engine for each prompt, with metadata. Agent here is essentially an orchestration layer that fires each prompt against multiple engine APIs in parallel, captures the responses, retries on rate limits and errors, and stores the raw results in a structured form. Less "agent" in the cognitive sense, more "agent" in the workflow sense — it owns the multi-step process end-to-end.

Stage 3 — Classification. Input: the raw response set from Stage 2. Output: a classified dataset where each response is tagged for brand mention, competitor mentions, sentiment, citation behaviour, and answer adequacy. This is where the largest cluster of agent work happens. We initially tried doing this with a single agent processing all responses. The result was good but inconsistent — the agent's classification of mention sentiment, in particular, drifted across long sessions. We split the work into a per-response classification agent (stateless, processes one response at a time) plus a separate consistency-checker agent that reviews the classifications across the dataset for outliers.

Stage 4 — Analysis and exhibit generation. Input: the classified dataset. Output: share-of-voice metrics, competitor comparison data, citation-source analysis, and the data exhibits (charts, tables) that go into the report. Mostly deterministic data work with a model-based summary layer on top.

Stage 5 — Report drafting. Input: the analysis from Stage 4 plus the original scoping context from Stage 1. Output: a complete draft of the report, formatted as PDF, with all sections written. Multiple agent calls running in parallel on different sections of the report, with a shared style guide and a shared reference to the analysis from Stage 4.

A separate QA agent runs after Stage 5 — reads the entire draft, checks for inconsistencies between sections, verifies data references, flags hallucinations, and produces a list of items that need correction. The QA agent's output is fed back to a correction agent that addresses each flagged item.

A human reviews the final output before it's sent to the client. The human's job at this point is sense-checking the strategic recommendations, not catching errors in the data — the QA agents have caught most of them.

The failure modes we saw in production

The agentic version isn't just the manual workflow with humans replaced. The failure modes are different and deserve naming.

Hallucinated source citations. Early versions of the report-drafting agents would, occasionally, cite specific sources or quotes that hadn't appeared in the AI engines' responses. The agent was inventing references that fitted the narrative. Caught by the QA agent, but only after we built the QA agent specifically to check for this. Rule: production agentic systems doing factual work need a verification layer. Not optional.

Drift in long classification runs. As mentioned above. A single agent classifying 600 responses in a single session would, by the end, classify slightly differently from how it classified at the beginning. The drift wasn't catastrophic, but it was enough to corrupt the share-of-voice numbers. Solved by per-response classification calls instead of session-based ones.

Cascade failures in pipeline stages. Early versions would, occasionally, produce a perfectly-formed but wrong artefact at Stage 1, and the rest of the pipeline would happily proceed on the wrong inputs. The audit got produced. It was internally consistent. It was for the wrong target brand. Solved by Stage 1 verification — explicit human (or agent) confirmation that the scoping output matches the customer's intent before Stage 2 runs.

Cost runaway on retries. Early versions had aggressive retry logic. Combined with the parallel orchestration, one bad retry pattern could consume 5-10x the expected token budget for a single audit. Solved by per-stage budget limits and circuit breakers.

None of these are dealbreakers. All of them are the kind of operational concerns that don't show up in chatbot demos and become very real when running at production volume — covered in detail in what you need before shipping an agentic system to production.

What we'd build differently next time

A few things, in honest retrospect.

Start with the QA agent, not the workflow agents. Building the QA layer first would have caught the failure modes in earlier iterations and saved us several rounds of debugging. Production agentic systems need verification before they need new capabilities.

Be more explicit about state. Our first version had implicit state — what each stage was doing was inferred from the data shape, not declared. When something went wrong, debugging required tracing the data through several stages to figure out where the error was introduced. The current version has explicit state declarations at each stage. Much easier to reason about.

Cost monitoring from day one. We weren't watching token spend per audit closely enough in early production. By the time we noticed certain audits were costing 5x the average, we'd lost meaningful margin on a chunk of customers. The current version has a per-audit cost ceiling and a real-time monitoring layer.

Smaller, more specialised agents. The early versions had bigger, more general agents trying to do more. The agents that ship now are smaller, more specialised, and easier to reason about. The discipline of "one agent, one job" produced better results than "one agent, all the jobs".

What this means for anyone building similar systems

The high-level lesson from building this is that the leverage of agentic systems is real, the cost reduction can be dramatic, and the operational discipline required to capture either is significant.

If I were giving advice to a team thinking about building an equivalent system in their own domain — and I get this question a lot — the framing I'd offer is:

Pick a workflow that currently consumes meaningful senior time. Map it stage by stage. Identify the verification points. Build the verification first. Then build the workflow agents. Spend more time than you think you need to on observability, logging, and cost monitoring. Run it in shadow mode against real work for several weeks before customers see the output.

The technology is genuinely capable of replacing meaningful chunks of senior knowledge work. The discipline to do that reliably, in production, is what most teams don't yet have. That gap is where the actual leverage gets captured.

If you're interested in the product itself, amivisible.co is the front door. If you're interested in the architecture for building something similar in your domain, that's a longer conversation — drop me a note.

Anatomy of a real agentic workflow — the amivisible.co audit pipeline, taken apart

The workflow before the agent

The architecture of the agentic version

The failure modes we saw in production

What we'd build differently next time

What this means for anyone building similar systems

Stop building chatbots. Start building agents.

The leverage point 90% of marketing AI projects miss

Want to talk about this?