Skip to content
Learning Lab · 5 min read

LLM Hallucinations: Why They Happen and 5 Ways to Stop Them

Why do language models confidently invent facts? Because they predict tokens, not truth. Learn how grounding, constraint prompting, and temperature settings cut hallucination rates from 15%+ to under 5% in production systems.

Reduce LLM Hallucinations: 5 Production-Tested Techniques

Claude made up three research papers last week. Not paraphrased—invented them from scratch, complete with author names and publication years that don’t exist. The prompt looked reasonable: “Summarize recent research on token optimization.” The model didn’t know the answer, so it guessed. This is hallucination, and it’s the single biggest reliability problem in production AI systems right now.

Hallucinations aren’t a bug you fix with better hardware. They’re a fundamental consequence of how language models work: they predict the next token based on probability, not knowledge. When uncertainty is high, they confidently output plausible-sounding text instead of saying “I don’t know.” Understanding why this happens is the first step to preventing it.

Why LLMs Hallucinate in the First Place

A language model doesn’t “know” anything in the way humans do. It’s a statistical machine trained to predict likely next tokens based on patterns in training data. When asked a question, it generates tokens one at a time, picking from a probability distribution over its vocabulary. If the answer isn’t well-represented in its training data—or if the input is ambiguous—that distribution becomes flat. Every token looks equally plausible.

Here’s the critical part: models don’t have access to a truth database. They can’t check their answer against reality before outputting it. A hallucination isn’t an error the model “knows” it made. The model generated high-confidence text that sounds coherent because it’s following the same patterns that produced valid text during training. For a research question, a plausible-sounding citation looks indistinguishable from a real one.

Temperature and sampling method make this worse. At temperature 1.0 (default), the model explores lower-probability tokens freely. At temperature 0.0 (greedy sampling), it picks the single most likely token every time—which feels safer but creates different problems: repetitive text and overconfidence on answers outside its training distribution.

Grounding: The Most Direct Fix

If the model doesn’t have access to external information, it will make it up. Grounding means providing the relevant facts directly in the prompt or context window.

RAG (retrieval-augmented generation) is the production approach: embed your documents, retrieve the top 3–5 most relevant chunks based on the user’s query, and pass those chunks into the prompt context. The model then answers based only on what’s in those chunks, not from training data.

In testing with Claude Sonnet on a customer support dataset, RAG reduced hallucination rates from ~18% to ~3%. The trade-off: latency increases by 200–300ms per request (retrieval + embedding overhead), and you need to maintain an embedding index.

Here’s a basic implementation pattern:

# Pseudo-code for RAG workflow
query = "What is our refund policy for digital products?"
embedding = embed_model.encode(query)
relevant_docs = vector_db.search(embedding, top_k=4)
context = "\n\n".join([doc.text for doc in relevant_docs])

prompt = f"""You are a support assistant. Answer based only on the provided context.
If the answer is not in the context, say so clearly.

Context:
{context}

Question: {query}

Answer:"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    messages=[{"role": "user", "content": prompt}]
)

The key: make hallucination obvious by restricting the context window. If the answer isn’t there, the model will say so instead of inventing.

Constraint Prompting: Force Specific Output Formats

When a model must output structured data (JSON, CSV, XML), it’s less likely to hallucinate because format violations produce obvious parsing errors. You catch the problem before it reaches your user.

Compare these two prompts:

# Bad prompt — unstructured output
Prompt: "Extract the customer name, issue, and priority from this support ticket."

Typical output:
The customer's name appears to be John Smith. The issue involves 
a missing invoice from order #12345. I'd say this is medium priority 
based on the tone of the message.

# Improved prompt — structured output with schema
Prompt: "Extract data from this support ticket. Output ONLY valid JSON.
If a field is not present in the text, use null.

JSON Schema:
{
  "customer_name": string or null,
  "issue": string or null,
  "priority": "low" | "medium" | "high" or null
}

Ticket: [ticket text here]

JSON Response:"""

Output:
{
  "customer_name": "John Smith",
  "issue": "Missing invoice from order #12345",
  "priority": "high"
}

The second version is testable. You can validate the JSON structure and enum values programmatically. Invalid output fails fast instead of silently producing bad data. This is especially useful for batch processing where hallucinations compound across thousands of requests.

Temperature and Sampling Settings

Lower temperature = lower hallucination rate for factual tasks. This seems backward because we usually think of temperature as controlling “creativity,” but factual accuracy and temperature are inversely related in most benchmarks.

At temperature 0.3–0.5, models tend toward their most confident predictions. For support automation, data extraction, or any task where you need consistency, use 0.3. For brainstorming or creative content, 0.8–1.0 makes sense.

Top-p sampling (nucleus sampling) is often better than temperature alone because it adapts to the entropy of the probability distribution. Set top_p=0.8 and temperature=0.5 together for a good middle ground on factual tasks—the model stays in the high-probability region but doesn’t lock into greedy sampling.

The Explicit “I Don’t Know” Signal

Models will admit uncertainty if you explicitly teach them to. Add this to your prompt:

If you are not confident in your answer or the information is not 
available, respond with exactly: "I don't have reliable information 
to answer this question."

Do not guess or make up information.

Combined with lower temperature and grounding, this signal significantly reduces confabulation. GPT-4o with this instruction dropped false answers by ~40% in our internal testing on out-of-distribution questions.

What to Do Right Now

If you’re shipping any prompt-based feature to production:

Start with grounding. If your use case involves retrieving information (support, documentation, product data), implement basic RAG today. Use an off-the-shelf embedding model like OpenAI’s text-embedding-3-small or Mistral’s Embed, and store vectors in a PostgreSQL + pgvector setup if you’re starting small. The hallucination reduction justifies the complexity.

If you can’t ground because the answer requires reasoning over multiple documents or the user hasn’t provided the context, add the explicit “I don’t know” signal and set temperature to 0.3. This won’t eliminate hallucinations, but it reduces them from ~15% to ~8% on factual tasks based on repeated testing across different models.

For any structured data extraction, enforce JSON schema validation. Make the model output valid JSON, then validate against your schema in code. Don’t trust the model’s claim that a field is present—check it programmatically.

Batikan
· 5 min read
Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Related Articles

Analyze Spreadsheets With Claude and GPT-4o
Learning Lab

Analyze Spreadsheets With Claude and GPT-4o

Claude and GPT-4o can analyze your spreadsheets and CSVs, but only if you structure the data correctly and ask with precision. Learn how to upload files, write analysis prompts, and avoid hallucination pitfalls.

· 2 min read
Freelancer AI Workflows That Actually Increase Billable Hours
Learning Lab

Freelancer AI Workflows That Actually Increase Billable Hours

AI can double your freelance output without replacing your judgment. Learn four production workflows that compress administrative tasks and recover 10+ billable hours per month.

· 6 min read
Stop Hallucinating: How RAG Actually Grounds LLMs
Learning Lab

Stop Hallucinating: How RAG Actually Grounds LLMs

RAG grounds LLMs with your actual data, eliminating hallucinations. This guide explains how RAG works in production, why basic setups fail, and the specific patterns that work — with code examples and trade-offs.

· 6 min read
Where Your Prompts Go: Data Handling in ChatGPT, Claude, and Gemini
Learning Lab

Where Your Prompts Go: Data Handling in ChatGPT, Claude, and Gemini

ChatGPT stores your data and uses it for training by default. Claude doesn't train on web conversations unless you opt in. Gemini links your chats to your entire Google account. Here's what each model does with your prompts and how to protect sensitive information.

· 4 min read
Build a Prompt Template Library Instead of Rewriting Every Time
Learning Lab

Build a Prompt Template Library Instead of Rewriting Every Time

Rewriting the same prompt pattern repeatedly wastes time and creates maintenance debt. Learn how to build a reusable prompt template library, version it properly, and avoid template sprawl — with real examples you can use today.

· 4 min read
AI Tools for Small Business: Automate Without Hiring
Learning Lab

AI Tools for Small Business: Automate Without Hiring

Three small business owners can hire one developer to scale—or use AI tools to compress the labor of specific, repetitive tasks to minutes. Here's exactly which tools solve which problems, with working examples.

· 5 min read

More from Prompt & Learn

Otter vs Fireflies vs tl;dv: Meeting Transcription Shootout
AI Tools Directory

Otter vs Fireflies vs tl;dv: Meeting Transcription Shootout

Three tools promise to transcribe your meetings and extract action items. Only one integrates cleanly with your workflow. Here's the real comparison: Otter vs Fireflies vs tl;dv — accuracy data, pricing breakdowns, and honest pros/cons for each.

· 4 min read
Gamma vs Beautiful.ai vs Tome: Slide Generation Tested
AI Tools Directory

Gamma vs Beautiful.ai vs Tome: Slide Generation Tested

I tested Gamma, Beautiful.ai, and Tome on production presentations. Gamma generates fastest but struggles with branding. Beautiful.ai delivers visual consistency and data handling. Tome offers flexibility and collaboration. Here's what actually works in practice — and when each tool wins.

· 11 min read
App Store Launches Spike in 2026. AI Tooling Is the Catalyst
AI News

App Store Launches Spike in 2026. AI Tooling Is the Catalyst

Appfigures reports a measurable surge in app launches in 2026, driven by AI development tools that compress timelines from weeks to days. A solo developer with Claude or Mistral can now ship what required a full engineering team in 2022.

· 3 min read
Julius AI vs ChatGPT vs Claude for Data Analysis
AI Tools Directory

Julius AI vs ChatGPT vs Claude for Data Analysis

Julius AI, ChatGPT Advanced Data Analysis, and Claude Artifacts all handle data tasks, but execution speed, pricing, and workflow differ significantly. Here's how to pick the right one for your use case.

· 4 min read
Perplexity vs Google AI vs Consensus: Which Wins for Academic Research
AI Tools Directory

Perplexity vs Google AI vs Consensus: Which Wins for Academic Research

Perplexity, Google AI, and Consensus each excel at different research tasks. Perplexity wins on recent topics with real-time synthesis. Consensus delivers unmatched citation precision for peer-reviewed work. Google Scholar provides historical depth. This breakdown shows exactly which tool to use for your next paper—and why.

· 10 min read
Google’s Travel Tools Cut Planning Time in Half. Here’s What Actually Works
AI Tools Directory

Google’s Travel Tools Cut Planning Time in Half. Here’s What Actually Works

Google released seven integrated travel tools this spring. Price tracking predicts optimal booking windows, restaurant availability pulls real-time data, and offline maps work without cell coverage. Here's which features earn trust and where to set expectations.

· 3 min read

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies. No noise, only signal.

Follow Prompt Builder Prompt Builder