Problem

Your agent is hallucinating in production. Here's the playbook.

Agents hallucinate when they generate claims their context doesn't support — wrong facts, invented tool results, confident answers to questions they couldn't answer. In production you won't catch this by reading transcripts; you catch it by measuring groundedness on sampled traffic and alerting when the hallucination rate moves.

The symptom: what it looks like in the wild

Hallucination in agents rarely looks like nonsense. It looks like competence — fluent, specific, and wrong. Three patterns we see repeatedly:

The support agent invents a refund policy. A customer asks about returns outside the window. The retrieval step finds nothing relevant, and the agent — trained to be helpful — generates a plausible policy: "you're eligible for a full refund within 60 days." Your policy says 30. The customer now has it in writing.

The RAG agent cites a document that doesn't exist. Asked a question at the edge of its knowledge base, the agent answers and attributes the claim to "the Q3 compliance guidelines, section 4.2." There is no section 4.2. The citation makes the fabrication more convincing, not less.

The voice agent confirms a booking that didn't happen. The booking API timed out. The agent, mid-conversation and unwilling to disappoint, says "you're all set for Tuesday at 2pm." Nothing was booked. The failure surfaces three days later as a furious customer standing in a lobby.

The common thread: no error was thrown, no log line turned red, and nobody knew until a human downstream hit the consequence. That is what makes hallucination a different class of problem from a crash — and why the fix is a measurement loop, not a bug fix.

Root causes, ranked

When you trace hallucinated sessions back to the moment the unsupported claim appeared, the causes rank roughly like this:

Retrieval gaps. The single biggest cause. The agent needed information, retrieval returned nothing or the wrong thing, and the model filled the gap — because next-token prediction always produces something. If your agent is RAG-backed, assume this first.
Prompt ambiguity about missing information. Most prompts say what to do; few say what to do when the answer isn't available. Without an explicit "if the context does not contain the answer, say so" instruction — and an escape hatch the agent is rewarded for using — refusal loses to fabrication.
Tool-result misreading. Multi-step agents act on their own interpretation of tool output. An empty array gets read as success, an error payload as data, a partial result as the whole answer. The fabrication happens at the interpretation step, then compounds through everything downstream.
Context overflow. Long sessions push early constraints — the policy excerpt, the user's actual request — out of the effective window. The agent keeps answering, now from memory of the conversation's vibe rather than its facts.
Model choice. Real, but last on the list deliberately: teams reach for "we need a better model" first because it requires no investigation. Upgrade the model after you've ruled out the four causes above, or you'll pay more to hallucinate more fluently.

Detection: measure groundedness, track the rate

You cannot read every transcript, and spot-reading finds the failures you already expected. Production detection is a metric pair: groundedness scoring on sampled sessions, rolled up into a hallucination rate per agent and per version.

The mechanism is LLM-as-a-judge: a judge model receives the agent's output plus the context the agent had — retrieved documents and tool results — and flags every claim the context doesn't support. A minimal judge prompt:

You are auditing an AI agent's response for groundedness. CONTEXT the agent had available: {retrieved_documents} {tool_results} AGENT RESPONSE: {response} List each factual claim in the response. For each claim, mark SUPPORTED (directly backed by the context above) or UNSUPPORTED (not present in, or contradicted by, the context). Return: verdict (GROUNDED if all claims supported, else UNGROUNDED), the list of unsupported claims, and one sentence of reasoning.

Sampling guidance: start at 5–10% of sessions, weighted toward high-stakes flows (anything that commits money, confirms actions, or quotes policy). At typical judge-model prices this costs a fraction of a cent per scored session — orders of magnitude cheaper than one bad refund commitment. Human-review a slice of judge verdicts weekly to keep the judge calibrated. Then divide flagged outputs by scored outputs and plot it per agent, per version, over time. That chart is your detection system.

The fix, ordered by effort

1. Grounding instructions (hours)

Tell the agent explicitly: answer only from provided context; if the context doesn't contain the answer, say so and offer escalation; quote tool results verbatim rather than paraphrasing them; never state an action succeeded without a confirming tool response. Crude, immediate, and it measurably cuts the rate — though it is a mitigation, not a cure.

2. Retrieval tuning (days)

Since retrieval gaps are cause number one, fix what the agent sees: better chunking, hybrid search, reranking, and — most under-used — auditing the questions retrieval failed on (your ungrounded sessions from the detection step are exactly this list) and patching the knowledge base where it's thin.

3. Output validation (days)

For structured claims — booking IDs, order statuses, amounts, dates — verify the claimed value against the actual tool response in code before the response ships. A regex and an equality check beat a judge model for facts that have a source of truth in the session.

4. Eval-gated deploys (a week, then forever)

Every hallucination you catch becomes a test case: input, context, and the claim that went wrong. Run that growing golden dataset on every prompt or model change, with a groundedness judge scoring each case, and block deploys that regress. This is the step that converts hallucination from a recurring incident into a regression class you've closed. Our agent evals guide walks through building this loop from zero.

Catch it continuously

Everything above works once — the failure mode is that it decays. The retrieval fix ships, the rate drops, attention moves on, and six weeks later a model update or a knowledge-base change quietly moves the rate back up. Nobody notices, because nobody is looking.

The difference between an incident and a metric is whether you were measuring. Prefactor runs this loop as infrastructure: groundedness scoring on sampled live traffic, hallucination rate tracked per agent and per version, alerts when the trend regresses, and every flagged session one click from its full trace so the fix starts with evidence instead of a reproduction hunt.

It sits alongside the rest of your agent analytics — quality, cost, and performance in one place — so the question "are the agents okay?" has a dashboard, not a debate.

Measure your hallucination rate this week

Connect your traces, turn on groundedness scoring, and see the rate per agent before you change a single prompt.

Book a demo →

Frequently asked questions

Why is my AI agent hallucinating?

The most common causes, in order of likelihood: the retrieval step did not supply the information the agent needed, so it filled the gap; the prompt is ambiguous about what to do when information is missing; the agent misread a tool result (empty result treated as confirmation, error treated as data); the context window overflowed and earlier constraints fell out; or the model itself is too weak for the task. Diagnose by reading the full traces of five hallucinated sessions — the cause is usually visible in what the agent had in context at the moment it made the unsupported claim.

How do I detect hallucinations in production without reading every output?

Sample and score. Route a percentage of sessions (start at 5–10%) to an LLM-as-a-judge groundedness check: the judge receives the agent output plus the context the agent had — retrieved documents and tool results — and flags any claim not supported by that context. Track the flagged share as a hallucination rate per agent and per version, and alert when it moves. Spot-audit a slice of judge verdicts by hand each week to keep the judge honest.

How do I stop an agent from making up tool results?

Three controls, in order of effort. First, prompt-level: require the agent to quote tool output verbatim when reporting results, and to state explicitly when a tool returned nothing — fabrication thrives on paraphrase. Second, validation: where outputs are structured (a booking ID, an order status), verify the claimed value against the actual tool response in code before the response ships. Third, eval-gate it: add the failure to a golden dataset, write a judge eval that checks claimed tool results against actual ones, and block deploys that regress.

What hallucination rate is acceptable?

There is no universal number — it depends on blast radius. Internal copilots with expert users reviewing output can tolerate rates that would be unacceptable elsewhere. Customer-facing agents need materially lower rates because each fabrication is a support ticket or a refund commitment you did not authorise. In regulated contexts, what matters as much as the rate is that you measure it, can show its trend, and alert on regression. Whatever your threshold, the unmeasured rate is the unacceptable one.