Learning Lab April 18, 2026 · 6 min read

Stop Hallucinating: How RAG Actually Grounds LLMs

RAG grounds LLMs with your actual data, eliminating hallucinations. This guide explains how RAG works in production, why basic setups fail, and the specific patterns that work — with code examples and trade-offs.

Your
LLM
just
confidently
cited
a
research
paper
that
doesn’t
exist.
You
asked
it
about
your
company’s
API
docs,
and
it
described
endpoints
that
were
deprecated
in
2019.
This
happens
because
language
models
generate
text
based
on
patterns
in
training
data,
not
by
querying
your
actual
information.

Retrieval
Augmented
Generation
(RAG)
fixes
this.
Not
by
making
models
smarter,
but
by
giving
them
access
to
real
data
before
they
generate
a
response.
The
technique
has
become
essential
for
production
systems,
but
most
implementations
fail
quietly
—
either
returning
irrelevant
documents,
or
improving
retrieval
so
much
that
the
model
gets
confused
by
too
much
context.

This
guide
walks
through
how
RAG
actually
works,
why
basic
setups
fail,
and
the
specific
patterns
that
work
in
production.

What
RAG
Actually
Does

RAG
has
three
steps:

Retrieve:
Search
your
document
store
for
content
relevant
to
the
user’s
question
Augment:
Inject
the
retrieved
documents
into
the
prompt
alongside
the
original
question
Generate:
The
LLM
responds
using
both
the
question
and
the
retrieved
context

The
critical
insight:
the
model
never
hallucinates
about
your
data
because
it’s
looking
at
your
actual
data
while
generating
the
response.
It
has
the
information
right
in
front
of
it.

But
here’s
where
implementations
diverge.
A
naive
RAG
system
retrieves
every
vaguely
relevant
document,
floods
the
context
window,
and
the
model
gets
lost
in
noise.
A
well-tuned
RAG
system
retrieves
exactly
what
matters,
ranks
it
by
relevance,
and
the
model
has
clear
guidance.

Why
Basic
RAG
Fails:
The
Three
Common
Breakdowns

Before
fixing
anything,
understand
where
it
breaks.

Breakdown
1:
Retrieval
returns
noise.
You
search
your
document
database
using
basic
keyword
matching
or
simple
embeddings,
and
get
back
10
results
where
only
2
are
relevant.
The
LLM
sees
the
noise
and
either
gets
confused
or
ignores
the
good
information
entirely.
This
is
the
most
common
failure
mode.

Breakdown
2:
Retrieved
documents
are
too
long.
Even
if
retrieval
works,
you
retrieve
full
documents
(800+
tokens
each).
Your
4K
context
window
is
gone.
Claude
or
GPT-4o
wastes
tokens
parsing
irrelevant
sections
and
has
no
room
for
nuance
in
the
response.

Breakdown
3:
The
retrieval-generation
mismatch.
Your
retrieval
system
finds
documents
optimized
for
one
type
of
query
(factual
lookup),
but
the
generation
step
needs
documents
structured
differently
(for
reasoning,
for
examples,
for
edge
cases).
You’re
solving
two
different
problems
with
one
retriever.

Building
a
Production
RAG
Pipeline:
Step
by
Step

A
working
RAG
system
needs
four
components:
a
document
store,
a
retriever,
a
ranker,
and
a
prompt
structure.
Let’s
build
one.

Step
1:
Document
Ingestion
and
Chunking

Raw
documents
are
too
large
to
retrieve
efficiently.
You
need
to
split
them
into
chunks
—
typically
300-500
tokens
each,
with
10-20%
overlap
between
chunks.

#
Example:
chunking
a
document
for
RAG
documents
=
load_documents("company_docs/")

chunks
=
[]
for
doc
in
documents:




text
=
doc.content




#
Naive
approach:
split
every
400
tokens




for
i
in
range(0,
len(text),
300):








chunk
=
text[i:i+400]








chunks.append({












"id":
f"{doc.name}_chunk_{len(chunks)}",












"content":
chunk,












"source":
doc.name,












"metadata":
doc.metadata








})

#
Better
approach:
chunk
at
semantic
boundaries
from
langchain.text_splitter
import
RecursiveCharacterTextSplitter
splitter
=
RecursiveCharacterTextSplitter(




chunk_size=400,




chunk_overlap=50,




separators=["

",
"
",
"
",
""]
)
chunks
=
[]
for
doc
in
documents:




splits
=
splitter.split_text(doc.content)




for
split
in
splits:








chunks.append({












"id":
f"{doc.name}_chunk_{len(chunks)}",












"content":
split,












"source":
doc.name








})

The
key
here:
don’t
split
randomly.
Use
RecursiveCharacterTextSplitter
to
break
at
paragraph
boundaries
first,
then
sentences,
then
words.
This
keeps
semantic
units
intact.

Step
2:
Embedding
and
Storage

Convert
each
chunk
into
a
vector
embedding.
This
is
what
makes
semantic
search
possible
—
similar
meanings
produce
similar
vectors,
regardless
of
exact
word
matches.

#
Embedding
chunks
and
storing
them
from
openai
import
OpenAI
from
pinecone
import
Pinecone

client
=
OpenAI()
pc
=
Pinecone(api_key="your-key")
index
=
pc.Index("rag-index")

#
Embed
each
chunk
embedded_chunks
=
[]
for
chunk
in
chunks:




response
=
client.embeddings.create(








input=chunk["content"],








model="text-embedding-3-small"




)




embedding
=
response.data[0].embedding




embedded_chunks.append({








"id":
chunk["id"],








"values":
embedding,








"metadata":
{












"text":
chunk["content"],












"source":
chunk["source"]








}




})

#
Upsert
to
vector
database
index.upsert(vectors=embedded_chunks)

Use
text-embedding-3-small
(or
similar,
like
Mistral
Embed):
it’s
cheaper
than
large
models,
fast
enough
for
real-time
retrieval,
and
strong
enough
for
semantic
search.
For
most
production
use
cases,
you
don’t
need
large
embeddings.

Step
3:
Retrieval
with
Ranking

This
is
where
most
systems
get
it
wrong.
Retrieval
finds
candidates;
ranking
orders
them
by
relevance.
You
need
both.

#
Retrieval
+
ranking
pipeline
def
retrieve_and_rank(query,
top_k=10,
final_k=3):




#
Step
1:
Vector
search
(recall-focused)




query_embedding
=
client.embeddings.create(








input=query,








model="text-embedding-3-small"




).data[0].embedding









candidates
=
index.query(








vector=query_embedding,








top_k=top_k,








include_metadata=True




)









#
Step
2:
Re-ranking
(precision-focused)




#
Use
a
small
ranker
model
or
LLM
to
re-score




documents
=
[








{












"id":
match["id"],












"text":
match["metadata"]["text"],












"source":
match["metadata"]["source"]








}








for
match
in
candidates["matches"]




]









#
Score
each
candidate
against
the
query




scores
=
[]




for
doc
in
documents:








#
Use
Claude
to
check
relevance








response
=
client.messages.create(












model="claude-3-5-sonnet-20241022",












max_tokens=10,












system="Rate
relevance
of
this
document
to
the
query
(1-10,
number
only).",












messages=[{
















"role":
"user",
















"content":
f"Query:
{query}

Document:
{doc['text']}"












}]








)








score
=
int(response.content[0].text)








scores.append((doc,
score))









#
Return
top-ranked
documents




ranked
=
sorted(scores,
key=lambda
x:
x[1],
reverse=True)




return
[doc
for
doc,
score
in
ranked[:final_k]]

This
two-stage
approach
works
because
vector
search
is
fast
but
imperfect.
A
second
ranking
step
—
even
a
simple
one
—
dramatically
improves
what
actually
gets
to
the
LLM.
In
production
at
AlgoVesta,
adding
re-ranking
cut
our
irrelevant
retrieval
by
~60%.

Step
4:
Prompt
Construction

Now
that
you
have
relevant
documents,
how
do
you
inject
them
into
the
prompt?
Context
matters
enormously.

#
Bad
prompt:
too
vague
about
context
bad_prompt
=
f"""
Answer
this
question:
{user_question}

Here's
some
background:
{retrieved_documents}
"""

#
Better
prompt:
explicit
instructions
for
using
context
better_prompt
=
f"""
You
have
access
to
the
following
documents:

---
{retrieved_documents}
---

Answer
the
user's
question
using
ONLY
information
from
the
documents
above.
If
the
documents
don't
contain
enough
information
to
answer,
say
so
explicitly.
Always
cite
the
source
document
when
you
use
information
from
it.

Question:
{user_question}
"""

The
improvement:
explicit
instruction
about
what
to
do
with
the
context.
Tell
the
model
to
use
only
the
documents
provided,
to
say
when
information
is
missing,
and
to
cite
sources.
Without
these
instructions,
models
will
sometimes
fall
back
on
training
data
when
documents
are
ambiguous.

Common
RAG
Patterns
and
When
They
Work

Three
patterns
dominate
production
RAG
systems.
Each
solves
different
problems.

Pattern
1:
Retrieval-Augmented
Generation
(Standard
RAG)

How
it
works:
Retrieve
documents,
add
to
prompt,
generate
response
in
one
pass.

When
it
works:
Factual
lookup,
Q&A
over
documentation,
straightforward
question-answering.
Response
time
matters.
You
need
a
single
coherent
answer.

Latency:
~500ms-1s
(embedding
+
retrieval
+
LLM
call)

Example
use
case:

Batikan

April 18, 2026 · 6 min read

Topics & Keywords

Learning Lab quot quot quot documents rag retrieval chunk chunks doc

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Three AI SEO tools claim they'll fix your ranking problem: Surfer, Ahrefs AI, and SEMrush. Each analyzes competing content differently—leading to different recommendations and different results. Here's what actually works, when each tool fails, and which one to buy based on your team's constraints.

Apr 16, 2026 · 9 min read

→

What RAG Actually Does

Why Basic RAG Fails: The Three Common Breakdowns

Building a Production RAG Pipeline: Step by Step

Step 1: Document Ingestion and Chunking

Step 2: Embedding and Storage

Step 3: Retrieval with Ranking

Step 4: Prompt Construction

Common RAG Patterns and When They Work

Pattern 1: Retrieval-Augmented Generation (Standard RAG)

📚 Related Articles

Stay ahead of the AI curve

Related Articles

Where Your Prompts Go: Data Handling in ChatGPT, Claude, and Gemini

Build a Prompt Template Library Instead of Rewriting Every Time

AI Tools for Small Business: Automate Without Hiring

Local LLMs vs Cloud APIs: True Cost, Speed, Privacy Trade-offs

Build Custom GPTs and Claude Projects Without Code

Tokenization Explained: Why Limits Matter and How to Stay Under Them

More from Prompt & Learn

App Store Launches Spike in 2026. AI Tooling Is the Catalyst

Julius AI vs ChatGPT vs Claude for Data Analysis

Perplexity vs Google AI vs Consensus: Which Wins for Academic Research

Google’s Travel Tools Cut Planning Time in Half. Here’s What Actually Works

DeepL vs ChatGPT vs Specialized Translation Tools: Real Benchmarks

Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best

Stay ahead of the AI curve

What
RAG
Actually
Does

Why
Basic
RAG
Fails:
The
Three
Common
Breakdowns

Building
a
Production
RAG
Pipeline:
Step
by
Step

Step
1:
Document
Ingestion
and
Chunking

Step
2:
Embedding
and
Storage

Step
3:
Retrieval
with
Ranking

Step
4:
Prompt
Construction

Common
RAG
Patterns
and
When
They
Work

Pattern
1:
Retrieval-Augmented
Generation
(Standard
RAG)