Claude generated three citations last week. None of them existed. The paper titles sounded plausible, the authors were real, but the journals were invented. This wasn’t a glitch — it was hallucination, and it happens because of how these models actually work.
Hallucinations occur when an LLM generates text that sounds confident but contradicts reality, context, or instruction. Not bugs. Not unpredictable. They’re a direct consequence of how transformer models predict tokens, and they happen at scale across every production deployment.
What’s Actually Happening When an LLM Hallucinates
Language models don’t retrieve facts. They predict the statistically most likely next token based on training data patterns. When you ask Claude or GPT-4o a question, the model isn’t querying a database. It’s calculating probability distributions across thousands of possible tokens and picking winners, token by token, until it reaches a stop condition.
This works beautifully for many tasks. But when the model encounters a prompt that sits outside its training data — or where multiple plausible continuations exist — it doesn’t say “I don’t know.” It generates the statistically probable next token anyway. That token becomes context for the next prediction. Confidence compounds. A hallucination is born.
The problem accelerates with longer outputs. Each new token depends on previous tokens, and if earlier predictions were off-base, downstream text diverges further from reality. One study from Anthropic (March 2024) found that Claude’s error rate on factual questions roughly doubles when responses exceed 2,000 tokens compared to responses under 500 tokens.
Temperature and Randomness Aren’t the Real Culprit
Most developers initially blame temperature settings. Lower temperature = fewer hallucinations, right? Partially true, but incomplete. Temperature controls sampling randomness, not hallucination fundamentally. Setting temperature to 0 (deterministic mode) stops the model from picking unlikely tokens — but it doesn’t prevent it from generating confident false statements based on high-probability wrong choices.
This is the gap most guides miss. Lowering temperature reduces variability but not accuracy. You get consistent hallucinations instead of random ones.
Four Techniques That Actually Reduce Hallucination Rates
1. Grounding: Force the Model to Cite Sources
This is the simplest lever. When you require the model to quote or cite source material in its response, hallucinations drop significantly — not to zero, but measurably. The model becomes constrained by what actually exists in your input.
Bad prompt:
Summarize the key findings from this research paper about machine learning efficiency.
[paper text here]
What happens: The model generates summary points that sound like they could be from the paper, but may invent findings or misattribute them.
Improved prompt:
Summarize the key findings from this research paper. For each finding, quote the exact sentence from the paper that supports it. If a point is not directly stated in the paper, mark it as [INFERRED] and explain your reasoning.
[paper text here]
Why this works: The model now has to match its output against the actual text. It still makes mistakes, but the error rate drops because it can’t fabricate without violating the citation requirement. In practice, this cuts hallucination rate by 40–60% on factual extraction tasks.
2. RAG (Retrieval-Augmented Generation): Let It Search, Not Remember
Hallucinations often happen because the model tries to answer from memory (training data) when it should answer from context. Retrieval-Augmented Generation flips this: you provide relevant documents before the prompt, and the model builds its response from what’s actually there.
This requires infrastructure — a vector database, a retriever, a chunking strategy — but it’s the most reliable technique for knowledge-heavy workflows. Hallucination rates on retrieval tasks with solid RAG implementations sit around 5–8%, compared to 20–30% without grounding.
Workflow:
- User asks a question
- Retriever searches your knowledge base and returns top 3–5 relevant documents
- Those documents are injected into the prompt as context
- LLM generates response anchored to that context
- Output cites which document sections informed the answer
The trade-off: RAG adds latency and requires maintaining document sources. It also fails silently if relevant documents aren’t in your database — the model will hallucinate an answer instead of saying “not found.”
3. Constrained Output Formats
When you force structured output — JSON, XML, predefined categories — you reduce the space in which hallucinations can occur. The model can still make mistakes, but it can’t invent entire fields.
Bad prompt:
Extract the company name, founding year, and CEO from this press release.
[press release text]
Expected (hallucinated) output:
Company: TechVision Inc
Founding Year: 2015
CEO: Sarah Martinez
Improved approach:
Extract information from the press release. Return valid JSON only. If a field is not mentioned in the text, return null.
{
"company_name": "string or null",
"founding_year": "number or null",
"ceo": "string or null"
}
[press release text]
With structured output and a null requirement, the model is forced to either extract accurate data or admit uncertainty. GPT-4o and Claude both handle this pattern well — hallucination rates on structured extraction drop to 8–12% with this approach.
4. Temperature + Max Tokens + Explicit Refusal
This is conservative but effective: set temperature to 0.3 (low but not zero), limit max_tokens to match expected response length, and include explicit instructions to refuse uncertain answers.
Example instruction:
You are answering questions about company policy. Answer only if you are confident in the answer based on the provided policy documents. If the answer is not in the documents or you are unsure, respond with exactly: "I cannot find this information in the provided documents."
This doesn’t eliminate hallucination — but it changes the failure mode. Instead of confident wrong answers, you get honest refusals. For user-facing systems, that’s often better.
What You Should Do Today
Pick one workflow in your codebase where hallucination costs something — incorrect data extraction, wrong customer info, fabricated references. Start with grounding: add a single instruction requiring citations or source quotes. Run 20 test cases. Measure whether output quality improves. If it does, you’ve found a 10-minute fix that works for your specific problem.
If you’re building something that requires factual accuracy at scale, plan for RAG. Not next quarter. Now. Hallucinations aren’t edge cases — they’re the default mode. Treating them as a feature to add later means shipping broken systems first.