Skip to content
Learning Lab · 6 min read

Stop Hallucinating: How RAG Actually Grounds LLMs

RAG grounds LLMs with your actual data, eliminating hallucinations. This guide explains how RAG works in production, why basic setups fail, and the specific patterns that work — with code examples and trade-offs.

RAG Guide: Ground LLMs With Your Data

Your
LLM
just
confidently
cited
a
research
paper
that
doesn’t
exist.
You
asked
it
about
your
company’s
API
docs,
and
it
described
endpoints
that
were
deprecated
in
2019.
This
happens
because
language
models
generate
text
based
on
patterns
in
training
data,
not
by
querying
your
actual
information.

Retrieval
Augmented
Generation
(RAG)
fixes
this.
Not
by
making
models
smarter,
but
by
giving
them
access
to
real
data
before
they
generate
a
response.
The
technique
has
become
essential
for
production
systems,
but
most
implementations
fail
quietly

either
returning
irrelevant
documents,
or
improving
retrieval
so
much
that
the
model
gets
confused
by
too
much
context.

This
guide
walks
through
how
RAG
actually
works,
why
basic
setups
fail,
and
the
specific
patterns
that
work
in
production.

What
RAG
Actually
Does

RAG
has
three
steps:

  1. Retrieve:
    Search
    your
    document
    store
    for
    content
    relevant
    to
    the
    user’s
    question
  2. Augment:
    Inject
    the
    retrieved
    documents
    into
    the
    prompt
    alongside
    the
    original
    question
  3. Generate:
    The
    LLM
    responds
    using
    both
    the
    question
    and
    the
    retrieved
    context

The
critical
insight:
the
model
never
hallucinates
about
your
data
because
it’s
looking
at
your
actual
data
while
generating
the
response.
It
has
the
information
right
in
front
of
it.

But
here’s
where
implementations
diverge.
A
naive
RAG
system
retrieves
every
vaguely
relevant
document,
floods
the
context
window,
and
the
model
gets
lost
in
noise.
A
well-tuned
RAG
system
retrieves
exactly
what
matters,
ranks
it
by
relevance,
and
the
model
has
clear
guidance.

Why
Basic
RAG
Fails:
The
Three
Common
Breakdowns

Before
fixing
anything,
understand
where
it
breaks.

Breakdown
1:
Retrieval
returns
noise.

You
search
your
document
database
using
basic
keyword
matching
or
simple
embeddings,
and
get
back
10
results
where
only
2
are
relevant.
The
LLM
sees
the
noise
and
either
gets
confused
or
ignores
the
good
information
entirely.
This
is
the
most
common
failure
mode.

Breakdown
2:
Retrieved
documents
are
too
long.

Even
if
retrieval
works,
you
retrieve
full
documents
(800+
tokens
each).
Your
4K
context
window
is
gone.
Claude
or
GPT-4o
wastes
tokens
parsing
irrelevant
sections
and
has
no
room
for
nuance
in
the
response.

Breakdown
3:
The
retrieval-generation
mismatch.

Your
retrieval
system
finds
documents
optimized
for
one
type
of
query
(factual
lookup),
but
the
generation
step
needs
documents
structured
differently
(for
reasoning,
for
examples,
for
edge
cases).
You’re
solving
two
different
problems
with
one
retriever.

Building
a
Production
RAG
Pipeline:
Step
by
Step

A
working
RAG
system
needs
four
components:
a
document
store,
a
retriever,
a
ranker,
and
a
prompt
structure.
Let’s
build
one.

Step
1:
Document
Ingestion
and
Chunking

Raw
documents
are
too
large
to
retrieve
efficiently.
You
need
to
split
them
into
chunks

typically
300-500
tokens
each,
with
10-20%
overlap
between
chunks.

#
Example:
chunking
a
document
for
RAG
documents
=
load_documents("company_docs/")

chunks
=
[]
for
doc
in
documents:




text
=
doc.content




#
Naive
approach:
split
every
400
tokens




for
i
in
range(0,
len(text),
300):








chunk
=
text[i:i+400]








chunks.append({












"id":
f"{doc.name}_chunk_{len(chunks)}",












"content":
chunk,












"source":
doc.name,












"metadata":
doc.metadata








})

#
Better
approach:
chunk
at
semantic
boundaries
from
langchain.text_splitter
import
RecursiveCharacterTextSplitter
splitter
=
RecursiveCharacterTextSplitter(




chunk_size=400,




chunk_overlap=50,




separators=["

",
"
",
"
",
""]
)
chunks
=
[]
for
doc
in
documents:




splits
=
splitter.split_text(doc.content)




for
split
in
splits:








chunks.append({












"id":
f"{doc.name}_chunk_{len(chunks)}",












"content":
split,












"source":
doc.name








})

The
key
here:
don’t
split
randomly.
Use
RecursiveCharacterTextSplitter
to
break
at
paragraph
boundaries
first,
then
sentences,
then
words.
This
keeps
semantic
units
intact.

Step
2:
Embedding
and
Storage

Convert
each
chunk
into
a
vector
embedding.
This
is
what
makes
semantic
search
possible

similar
meanings
produce
similar
vectors,
regardless
of
exact
word
matches.

#
Embedding
chunks
and
storing
them
from
openai
import
OpenAI
from
pinecone
import
Pinecone

client
=
OpenAI()
pc
=
Pinecone(api_key="your-key")
index
=
pc.Index("rag-index")

#
Embed
each
chunk
embedded_chunks
=
[]
for
chunk
in
chunks:




response
=
client.embeddings.create(








input=chunk["content"],








model="text-embedding-3-small"




)




embedding
=
response.data[0].embedding




embedded_chunks.append({








"id":
chunk["id"],








"values":
embedding,








"metadata":
{












"text":
chunk["content"],












"source":
chunk["source"]








}




})

#
Upsert
to
vector
database
index.upsert(vectors=embedded_chunks)

Use
text-embedding-3-small
(or
similar,
like
Mistral
Embed):
it’s
cheaper
than
large
models,
fast
enough
for
real-time
retrieval,
and
strong
enough
for
semantic
search.
For
most
production
use
cases,
you
don’t
need
large
embeddings.

Step
3:
Retrieval
with
Ranking

This
is
where
most
systems
get
it
wrong.
Retrieval
finds
candidates;
ranking
orders
them
by
relevance.
You
need
both.

#
Retrieval
+
ranking
pipeline
def
retrieve_and_rank(query,
top_k=10,
final_k=3):




#
Step
1:
Vector
search
(recall-focused)




query_embedding
=
client.embeddings.create(








input=query,








model="text-embedding-3-small"




).data[0].embedding









candidates
=
index.query(








vector=query_embedding,








top_k=top_k,








include_metadata=True




)









#
Step
2:
Re-ranking
(precision-focused)




#
Use
a
small
ranker
model
or
LLM
to
re-score




documents
=
[








{












"id":
match["id"],












"text":
match["metadata"]["text"],












"source":
match["metadata"]["source"]








}








for
match
in
candidates["matches"]




]









#
Score
each
candidate
against
the
query




scores
=
[]




for
doc
in
documents:








#
Use
Claude
to
check
relevance








response
=
client.messages.create(












model="claude-3-5-sonnet-20241022",












max_tokens=10,












system="Rate
relevance
of
this
document
to
the
query
(1-10,
number
only).",












messages=[{
















"role":
"user",
















"content":
f"Query:
{query}

Document:
{doc['text']}"












}]








)








score
=
int(response.content[0].text)








scores.append((doc,
score))









#
Return
top-ranked
documents




ranked
=
sorted(scores,
key=lambda
x:
x[1],
reverse=True)




return
[doc
for
doc,
score
in
ranked[:final_k]]

This
two-stage
approach
works
because
vector
search
is
fast
but
imperfect.
A
second
ranking
step

even
a
simple
one

dramatically
improves
what
actually
gets
to
the
LLM.
In
production
at
AlgoVesta,
adding
re-ranking
cut
our
irrelevant
retrieval
by
~60%.

Step
4:
Prompt
Construction

Now
that
you
have
relevant
documents,
how
do
you
inject
them
into
the
prompt?
Context
matters
enormously.

#
Bad
prompt:
too
vague
about
context
bad_prompt
=
f"""
Answer
this
question:
{user_question}

Here's
some
background:
{retrieved_documents}
"""

#
Better
prompt:
explicit
instructions
for
using
context
better_prompt
=
f"""
You
have
access
to
the
following
documents:

---
{retrieved_documents}
---

Answer
the
user's
question
using
ONLY
information
from
the
documents
above.
If
the
documents
don't
contain
enough
information
to
answer,
say
so
explicitly.
Always
cite
the
source
document
when
you
use
information
from
it.

Question:
{user_question}
"""

The
improvement:
explicit
instruction
about
what
to
do
with
the
context.
Tell
the
model
to
use
only
the
documents
provided,
to
say
when
information
is
missing,
and
to
cite
sources.
Without
these
instructions,
models
will
sometimes
fall
back
on
training
data
when
documents
are
ambiguous.

Common
RAG
Patterns
and
When
They
Work

Three
patterns
dominate
production
RAG
systems.
Each
solves
different
problems.

Pattern
1:
Retrieval-Augmented
Generation
(Standard
RAG)

How
it
works:

Retrieve
documents,
add
to
prompt,
generate
response
in
one
pass.

When
it
works:

Factual
lookup,
Q&A
over
documentation,
straightforward
question-answering.
Response
time
matters.
You
need
a
single
coherent
answer.

Latency:
~500ms-1s
(embedding
+
retrieval
+
LLM
call)

Example
use
case:

Batikan
· 6 min read
Topics & Keywords
Learning Lab quot quot quot documents rag retrieval chunk chunks doc
Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Related Articles

Where Your Prompts Go: Data Handling in ChatGPT, Claude, and Gemini
Learning Lab

Where Your Prompts Go: Data Handling in ChatGPT, Claude, and Gemini

ChatGPT stores your data and uses it for training by default. Claude doesn't train on web conversations unless you opt in. Gemini links your chats to your entire Google account. Here's what each model does with your prompts and how to protect sensitive information.

· 4 min read
Build a Prompt Template Library Instead of Rewriting Every Time
Learning Lab

Build a Prompt Template Library Instead of Rewriting Every Time

Rewriting the same prompt pattern repeatedly wastes time and creates maintenance debt. Learn how to build a reusable prompt template library, version it properly, and avoid template sprawl — with real examples you can use today.

· 4 min read
AI Tools for Small Business: Automate Without Hiring
Learning Lab

AI Tools for Small Business: Automate Without Hiring

Three small business owners can hire one developer to scale—or use AI tools to compress the labor of specific, repetitive tasks to minutes. Here's exactly which tools solve which problems, with working examples.

· 5 min read
Local LLMs vs Cloud APIs: True Cost, Speed, Privacy Trade-offs
Learning Lab

Local LLMs vs Cloud APIs: True Cost, Speed, Privacy Trade-offs

Local LLMs vs cloud APIs isn't a binary choice. This guide walks through real costs, latency benchmarks, accuracy trade-offs, and a production-tested hybrid architecture that uses both. Includes implementation code and a decision matrix based on your actual constraints.

· 9 min read
Build Custom GPTs and Claude Projects Without Code
Learning Lab

Build Custom GPTs and Claude Projects Without Code

Learn how to build a custom GPT or Claude Project without writing code. Step-by-step setup, real examples, and honest guidance on where these tools work—and where they don't.

· 2 min read
Tokenization Explained: Why Limits Matter and How to Stay Under Them
Learning Lab

Tokenization Explained: Why Limits Matter and How to Stay Under Them

Tokens aren't words, and misunderstanding them costs money and reliability. Learn what tokens actually are, why context windows matter, how to measure real usage, and four structural techniques to stay under limits without cutting functionality.

· 5 min read

More from Prompt & Learn

App Store Launches Spike in 2026. AI Tooling Is the Catalyst
AI News

App Store Launches Spike in 2026. AI Tooling Is the Catalyst

Appfigures reports a measurable surge in app launches in 2026, driven by AI development tools that compress timelines from weeks to days. A solo developer with Claude or Mistral can now ship what required a full engineering team in 2022.

· 3 min read
Julius AI vs ChatGPT vs Claude for Data Analysis
AI Tools Directory

Julius AI vs ChatGPT vs Claude for Data Analysis

Julius AI, ChatGPT Advanced Data Analysis, and Claude Artifacts all handle data tasks, but execution speed, pricing, and workflow differ significantly. Here's how to pick the right one for your use case.

· 4 min read
Perplexity vs Google AI vs Consensus: Which Wins for Academic Research
AI Tools Directory

Perplexity vs Google AI vs Consensus: Which Wins for Academic Research

Perplexity, Google AI, and Consensus each excel at different research tasks. Perplexity wins on recent topics with real-time synthesis. Consensus delivers unmatched citation precision for peer-reviewed work. Google Scholar provides historical depth. This breakdown shows exactly which tool to use for your next paper—and why.

· 10 min read
Google’s Travel Tools Cut Planning Time in Half. Here’s What Actually Works
AI Tools Directory

Google’s Travel Tools Cut Planning Time in Half. Here’s What Actually Works

Google released seven integrated travel tools this spring. Price tracking predicts optimal booking windows, restaurant availability pulls real-time data, and offline maps work without cell coverage. Here's which features earn trust and where to set expectations.

· 3 min read
DeepL vs ChatGPT vs Specialized Translation Tools: Real Benchmarks
AI Tools Directory

DeepL vs ChatGPT vs Specialized Translation Tools: Real Benchmarks

Google Translate works for menus, not client work. DeepL beats it on quality, ChatGPT wastes tokens, and professional tools like Smartcat solve team workflow problems. Here's the honest breakdown of what each tool actually does and when to use it.

· 4 min read
Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best
AI Tools Directory

Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best

Three AI SEO tools claim they'll fix your ranking problem: Surfer, Ahrefs AI, and SEMrush. Each analyzes competing content differently—leading to different recommendations and different results. Here's what actually works, when each tool fails, and which one to buy based on your team's constraints.

· 9 min read

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies. No noise, only signal.

Follow Prompt Builder Prompt Builder