Why does my chatbot sometimes make things up and how do I fix it?

Your chatbot just told a customer that your refund window is 60 days. It is 30. Nobody programmed that lie. The model made it up because making things up is, technically, what language models do.

This is called hallucination, and it is the most common complaint founders raise after deploying an AI chatbot. The good news is it is fixable. The bad news is the fix is not a single setting you toggle. It is a combination of architecture, prompting, and testing that you put in place before users discover the problem on their own.

What causes a chatbot to hallucinate incorrect answers?

Language models do not look things up. They predict the next word based on patterns learned during training. That sounds like a subtle distinction until you realize the implication: the model has no idea whether a sentence it generates is true. It only knows whether that sentence sounds like the kind of thing that follows the previous sentence.

When a user asks your chatbot about your pricing, the model has two options. If pricing information was part of what it learned from, it will try to recall it. If it was not, it will generate something plausible based on what similar-looking businesses charge in similar-looking contexts. Both outputs look identical to the user. One is accurate. One is invented.

A 2023 Stanford study found hallucination rates of 15-27% in enterprise chatbots operating without a grounding mechanism. That means roughly one in six responses contains something the model fabricated. For a customer-facing support bot, that rate is costly.

The root cause is not a bug. It is the fundamental design of how these models work. Treating it like a bug you can patch misses the point. The real fix changes the architecture so the model is not put in a position to guess.

How does retrieval-augmented generation reduce fabrication?

Retrieval-augmented generation, or RAG, works by giving the model the answer before it has to generate one. Instead of asking the model to recall your refund policy from training data that may not include it, a RAG system searches your documents the moment a user asks a question, pulls the most relevant passages, and hands those passages directly to the model. The model's job is no longer to remember. It is to summarize what it just read.

The business impact is measurable. A 2023 Meta research paper on RAG found it reduced hallucination rates by 60-85% compared to unaided language models on knowledge-intensive tasks. Separate benchmarks from Anthropic and OpenAI internal testing show similar ranges for production chatbots grounded in company documents.

Here is how that plays out in practice. You have 200 support articles, a pricing page, and a returns policy. Those documents go into a search index. When a user asks a question, the system searches that index in under a second, finds the three most relevant passages, and tells the model: answer this question using only the content below. The model reads the passages and writes a response. If the answer is in your documents, the response will be accurate. If it is not, the model can be instructed to say it does not have information on that, rather than guessing.

For a Timespade-built chatbot, this grounded architecture is the default. The difference between an ungrounded chatbot and one backed by RAG is roughly the difference between a new employee who guesses and one who checks the manual before answering.

Approach	Hallucination Rate	Best For	Limitation
Base language model (no grounding)	15-27%	Casual conversation, creative tasks	Not suitable for factual business queries
Prompt-constrained model	8-15%	Simple single-topic bots with limited scope	Breaks down as question variety grows
RAG-grounded chatbot	3-8%	Customer support, documentation Q&A, product assistants	Requires a maintained document library
Fine-tuned model plus RAG	1-4%	High-stakes verticals: legal, medical, financial	Higher build cost, more ongoing maintenance

Can prompt engineering alone solve the problem?

Prompt engineering is a phrase that covers everything from carefully written instructions to wishful thinking. The honest answer: it helps, but it cannot solve hallucination on its own.

Writing a system prompt that says "only answer from verified sources" does not give the model verified sources. It tells the model to prefer confident-sounding answers that match your tone. The model still draws from the same pool of uncertain training data.

What prompt engineering can do is reduce the blast radius of hallucination. Instructions like "if you are unsure, say so" or "do not answer questions outside these topics" catch some edge cases. A well-structured system prompt that defines the chatbot's role, limits its scope, and specifies fallback behavior will outperform a vague one. Research from Anthropic's 2023 model card testing showed careful prompting reduced hallucination frequency by about 20-30% without any architectural changes.

That is meaningful, especially for a narrow-purpose chatbot with limited question variety. But 20-30% improvement starting from a 20% baseline still leaves you at 14-16% fabrication. For anything customer-facing, that is not good enough.

The reliable path is to treat prompt engineering as one layer in a stack, not the whole solution. It works best when the model is already grounded in your documents via RAG, and prompting adds behavioral guardrails on top of that foundation. At Timespade, every AI chatbot build includes both: a knowledge base wired in through RAG and a system prompt tuned to the client's specific use case.

What testing process catches hallucinations before users do?

Most chatbot launches skip systematic testing entirely. The team tries a few questions, it looks fine, and the product goes live. That approach misses the cases that matter most: unusual phrasings, questions the system was not designed for, and the specific queries where models tend to confabulate.

A workable testing process for a non-technical founder has three parts.

Build an adversarial question set before launch. Write 40-60 questions that your chatbot might realistically receive, weighted toward the edges: ambiguous questions, questions about policies that changed recently, questions where the right answer is a clear admission of uncertainty. Run every question through the bot and log the responses. This is your baseline. If 10% of answers are wrong before launch, you know exactly what you are shipping.

Score responses against your source documents, not against your intuition. A human reviewer reading chatbot responses will unconsciously fill in gaps and judge answers as close enough. Compare each response to the specific document passage it should have cited. A response that is directionally correct but cites the wrong policy or the wrong number is a hallucination, even if it sounds fine.

Set a failure rate threshold and treat it as a launch criterion. A practical target for a customer support chatbot is under 5% factual error on your adversarial set. If the bot is above that before launch, it is not ready. This sounds obvious, but most teams do not write the number down or hold to it. A 2022 Gartner survey found 45% of companies monitoring AI systems caught quality issues within the first three months that pre-launch testing had not detected. A lightweight review of 20-30 real conversations per week, flagged for factual errors, will catch drift before it compounds.

Should I add a confidence score or disclaimer to responses?

The instinct to add a disclaimer is reasonable. If the chatbot might be wrong, tell users. The problem is that most implementations of this instinct do not work the way founders expect them to.

When users see a disclaimer on every response, they habituate to it within days. A 2021 study in the journal Computers in Human Behavior found that persistent AI disclaimers reduced user trust initially but had no effect on behavior after two weeks of regular use. The disclaimer becomes invisible, which means it stops protecting users while continuing to undermine confidence in every correct answer.

Confidence scores face a different problem: language models are not reliably calibrated. A model can be highly confident in a fabricated answer. The score shown often reflects fluency, not accuracy. Displaying a 94% confidence score next to an incorrect response is worse than showing no score, because it actively misleads.

The more effective alternative is a scoped response that appears only when the model genuinely cannot find an answer in your documents: "I could not find that in our documentation. Here is how to reach our support team." This reserves uncertainty signals for cases where they are actually true, rather than attaching them to every response as a liability shield.

For high-stakes applications where errors have financial or legal consequences, the better investment is in the grounding architecture and testing process described above. A chatbot that rarely hallucinates does not need a blanket disclaimer in the same way a chatbot with a 20% error rate does.

Approach	User Experience	Accuracy Impact	Recommended
Blanket disclaimer on every response	Users ignore it within days	None	No
Confidence score per response	Misleads when model is confidently wrong	Negative in some cases	No
Scoped not-found response	Builds trust on edge cases	Positive	Yes
RAG grounding plus adversarial testing	Seamless for users	Reduces errors by 60-85%	Yes

Building a chatbot that gets this right from the start costs less than repairing trust after a string of bad answers goes public. Timespade builds grounded chatbot systems across generative AI, product engineering, and data infrastructure. One team, one contract, and a testing process that catches problems before your users do.

Book a free discovery call

Approach

Hallucination Rate

Best For

Limitation

Base language model (no grounding)

15-27%

Casual conversation, creative tasks

Not suitable for factual business queries

Prompt-constrained model

8-15%

Simple single-topic bots with limited scope

Breaks down as question variety grows

RAG-grounded chatbot

3-8%

Customer support, documentation Q&A, product assistants

Requires a maintained document library

Fine-tuned model plus RAG

1-4%

High-stakes verticals: legal, medical, financial

Higher build cost, more ongoing maintenance

Approach

User Experience

Accuracy Impact

Recommended

Blanket disclaimer on every response

Users ignore it within days

None

Confidence score per response

Misleads when model is confidently wrong

Negative in some cases

Scoped not-found response

Builds trust on edge cases

Positive

Yes

RAG grounding plus adversarial testing

Seamless for users

Reduces errors by 60-85%

Yes

Why does my chatbot sometimes make things up and how do I fix it?

What causes a chatbot to hallucinate incorrect answers?

How does retrieval-augmented generation reduce fabrication?

Can prompt engineering alone solve the problem?

What testing process catches hallucinations before users do?

Should I add a confidence score or disclaimer to responses?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days

Why does my chatbot sometimes make things up and how do I fix it?

What causes a chatbot to hallucinate incorrect answers?

How does retrieval-augmented generation reduce fabrication?

Can prompt engineering alone solve the problem?

What testing process catches hallucinations before users do?

Should I add a confidence score or disclaimer to responses?

Related questions

How do I build AI workflows that chain multiple steps together?

Can AI handle invoice processing and accounts payable?

How do I automate customer onboarding with AI?

Can AI manage my inbox and respond to emails?

Announce in the next 28 days