Your
LLM
just
confidently
cited
a
research
paper
that
doesn’t
exist.
You
asked
it
about
your
company’s
API
docs,
and
it
described
endpoints
that
were
deprecated
in
2019.
This
happens
because
language
models
generate
text
based
on
patterns
in
training
data,
not
by
querying
your
actual
information.
Retrieval
Augmented
Generation
(RAG)
fixes
this.
Not
by
making
models
smarter,
but
by
giving
them
access
to
real
data
before
they
generate
a
response.
The
technique
has
become
essential
for
production
systems,
but
most
implementations
fail
quietly
—
either
returning
irrelevant
documents,
or
improving
retrieval
so
much
that
the
model
gets
confused
by
too
much
context.
This
guide
walks
through
how
RAG
actually
works,
why
basic
setups
fail,
and
the
specific
patterns
that
work
in
production.
What
RAG
Actually
Does
RAG
has
three
steps:
- Retrieve:
Search
your
document
store
for
content
relevant
to
the
user’s
question - Augment:
Inject
the
retrieved
documents
into
the
prompt
alongside
the
original
question - Generate:
The
LLM
responds
using
both
the
question
and
the
retrieved
context
The
critical
insight:
the
model
never
hallucinates
about
your
data
because
it’s
looking
at
your
actual
data
while
generating
the
response.
It
has
the
information
right
in
front
of
it.
But
here’s
where
implementations
diverge.
A
naive
RAG
system
retrieves
every
vaguely
relevant
document,
floods
the
context
window,
and
the
model
gets
lost
in
noise.
A
well-tuned
RAG
system
retrieves
exactly
what
matters,
ranks
it
by
relevance,
and
the
model
has
clear
guidance.
Why
Basic
RAG
Fails:
The
Three
Common
Breakdowns
Before
fixing
anything,
understand
where
it
breaks.
Breakdown
1:
Retrieval
returns
noise.
You
search
your
document
database
using
basic
keyword
matching
or
simple
embeddings,
and
get
back
10
results
where
only
2
are
relevant.
The
LLM
sees
the
noise
and
either
gets
confused
or
ignores
the
good
information
entirely.
This
is
the
most
common
failure
mode.
Breakdown
2:
Retrieved
documents
are
too
long.
Even
if
retrieval
works,
you
retrieve
full
documents
(800+
tokens
each).
Your
4K
context
window
is
gone.
Claude
or
GPT-4o
wastes
tokens
parsing
irrelevant
sections
and
has
no
room
for
nuance
in
the
response.
Breakdown
3:
The
retrieval-generation
mismatch.
Your
retrieval
system
finds
documents
optimized
for
one
type
of
query
(factual
lookup),
but
the
generation
step
needs
documents
structured
differently
(for
reasoning,
for
examples,
for
edge
cases).
You’re
solving
two
different
problems
with
one
retriever.
Building
a
Production
RAG
Pipeline:
Step
by
Step
A
working
RAG
system
needs
four
components:
a
document
store,
a
retriever,
a
ranker,
and
a
prompt
structure.
Let’s
build
one.
Step
1:
Document
Ingestion
and
Chunking
Raw
documents
are
too
large
to
retrieve
efficiently.
You
need
to
split
them
into
chunks
—
typically
300-500
tokens
each,
with
10-20%
overlap
between
chunks.
#
Example:
chunking
a
document
for
RAG
documents
=
load_documents("company_docs/")
chunks
=
[]
for
doc
in
documents:
text
=
doc.content
#
Naive
approach:
split
every
400
tokens
for
i
in
range(0,
len(text),
300):
chunk
=
text[i:i+400]
chunks.append({
"id":
f"{doc.name}_chunk_{len(chunks)}",
"content":
chunk,
"source":
doc.name,
"metadata":
doc.metadata
})
#
Better
approach:
chunk
at
semantic
boundaries
from
langchain.text_splitter
import
RecursiveCharacterTextSplitter
splitter
=
RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50,
separators=["
",
"
",
"
",
""]
)
chunks
=
[]
for
doc
in
documents:
splits
=
splitter.split_text(doc.content)
for
split
in
splits:
chunks.append({
"id":
f"{doc.name}_chunk_{len(chunks)}",
"content":
split,
"source":
doc.name
})
The
key
here:
don’t
split
randomly.
Use
RecursiveCharacterTextSplitter
to
break
at
paragraph
boundaries
first,
then
sentences,
then
words.
This
keeps
semantic
units
intact.
Step
2:
Embedding
and
Storage
Convert
each
chunk
into
a
vector
embedding.
This
is
what
makes
semantic
search
possible
—
similar
meanings
produce
similar
vectors,
regardless
of
exact
word
matches.
#
Embedding
chunks
and
storing
them
from
openai
import
OpenAI
from
pinecone
import
Pinecone
client
=
OpenAI()
pc
=
Pinecone(api_key="your-key")
index
=
pc.Index("rag-index")
#
Embed
each
chunk
embedded_chunks
=
[]
for
chunk
in
chunks:
response
=
client.embeddings.create(
input=chunk["content"],
model="text-embedding-3-small"
)
embedding
=
response.data[0].embedding
embedded_chunks.append({
"id":
chunk["id"],
"values":
embedding,
"metadata":
{
"text":
chunk["content"],
"source":
chunk["source"]
}
})
#
Upsert
to
vector
database
index.upsert(vectors=embedded_chunks)
Use
text-embedding-3-small
(or
similar,
like
Mistral
Embed):
it’s
cheaper
than
large
models,
fast
enough
for
real-time
retrieval,
and
strong
enough
for
semantic
search.
For
most
production
use
cases,
you
don’t
need
large
embeddings.
Step
3:
Retrieval
with
Ranking
This
is
where
most
systems
get
it
wrong.
Retrieval
finds
candidates;
ranking
orders
them
by
relevance.
You
need
both.
#
Retrieval
+
ranking
pipeline
def
retrieve_and_rank(query,
top_k=10,
final_k=3):
#
Step
1:
Vector
search
(recall-focused)
query_embedding
=
client.embeddings.create(
input=query,
model="text-embedding-3-small"
).data[0].embedding
candidates
=
index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
#
Step
2:
Re-ranking
(precision-focused)
#
Use
a
small
ranker
model
or
LLM
to
re-score
documents
=
[
{
"id":
match["id"],
"text":
match["metadata"]["text"],
"source":
match["metadata"]["source"]
}
for
match
in
candidates["matches"]
]
#
Score
each
candidate
against
the
query
scores
=
[]
for
doc
in
documents:
#
Use
Claude
to
check
relevance
response
=
client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=10,
system="Rate
relevance
of
this
document
to
the
query
(1-10,
number
only).",
messages=[{
"role":
"user",
"content":
f"Query:
{query}
Document:
{doc['text']}"
}]
)
score
=
int(response.content[0].text)
scores.append((doc,
score))
#
Return
top-ranked
documents
ranked
=
sorted(scores,
key=lambda
x:
x[1],
reverse=True)
return
[doc
for
doc,
score
in
ranked[:final_k]]
This
two-stage
approach
works
because
vector
search
is
fast
but
imperfect.
A
second
ranking
step
—
even
a
simple
one
—
dramatically
improves
what
actually
gets
to
the
LLM.
In
production
at
AlgoVesta,
adding
re-ranking
cut
our
irrelevant
retrieval
by
~60%.
Step
4:
Prompt
Construction
Now
that
you
have
relevant
documents,
how
do
you
inject
them
into
the
prompt?
Context
matters
enormously.
#
Bad
prompt:
too
vague
about
context
bad_prompt
=
f"""
Answer
this
question:
{user_question}
Here's
some
background:
{retrieved_documents}
"""
#
Better
prompt:
explicit
instructions
for
using
context
better_prompt
=
f"""
You
have
access
to
the
following
documents:
---
{retrieved_documents}
---
Answer
the
user's
question
using
ONLY
information
from
the
documents
above.
If
the
documents
don't
contain
enough
information
to
answer,
say
so
explicitly.
Always
cite
the
source
document
when
you
use
information
from
it.
Question:
{user_question}
"""
The
improvement:
explicit
instruction
about
what
to
do
with
the
context.
Tell
the
model
to
use
only
the
documents
provided,
to
say
when
information
is
missing,
and
to
cite
sources.
Without
these
instructions,
models
will
sometimes
fall
back
on
training
data
when
documents
are
ambiguous.
Common
RAG
Patterns
and
When
They
Work
Three
patterns
dominate
production
RAG
systems.
Each
solves
different
problems.
Pattern
1:
Retrieval-Augmented
Generation
(Standard
RAG)
How
it
works:
Retrieve
documents,
add
to
prompt,
generate
response
in
one
pass.
When
it
works:
Factual
lookup,
Q&A
over
documentation,
straightforward
question-answering.
Response
time
matters.
You
need
a
single
coherent
answer.
Latency:
~500ms-1s
(embedding
+
retrieval
+
LLM
call)
Example
use
case: