Skip to content
Learning Lab · 11 min read

Ground Your LLM: RAG Architecture That Actually Works

RAG grounds LLMs with your own data, eliminating hallucinations. This guide covers the four-stage pipeline, embedding models that actually matter, advanced retrieval techniques, and the exact setup that works in production. Includes code examples and a monitoring framework.

RAG Architecture That Works in Production

You built a chatbot last month. It confidently told your customer that you ship to Mongolia. You don’t. The LLM had no idea what it didn’t know, so it guessed.

This is the RAG problem — and the solution isn’t better prompting.

Retrieval-Augmented Generation (RAG) solves a specific failure mode: when an LLM doesn’t have the context it needs to answer correctly, it hallucinates. RAG inserts your actual data into the conversation before the LLM generates a response. No data, no hallucination. Simple enough in theory. Execution is where most implementations fail.

This guide walks through production RAG — why it works, where teams get stuck, how to build the architecture, and the specific setup decisions that make or break performance.

Why RAG Beats Fine-Tuning (and When It Doesn’t)

Before deciding on RAG, you need to know what you’re actually solving. There’s a fundamental difference between teaching an LLM and giving it data to reference.

Fine-tuning updates model weights. You feed the model thousands of examples of your data and expected outputs, and it learns patterns. The knowledge becomes part of the model. This costs money (hundreds to thousands per training run), takes time (hours to days), and permanently changes the model’s behavior. You can’t easily update it with new information without retraining.

RAG retrieves relevant context at inference time. You store your data in a separate database, and when a user asks a question, you fetch the relevant pieces and include them in the prompt. The model sees your current data on every request. You can update the data without retraining anything.

Here’s the practical choice:

  • Use RAG when: your data changes regularly (product docs, pricing, inventory, customer profiles), you need current information, you want to update without redeployment, or you’re not sure what the LLM needs to learn yet.
  • Use fine-tuning when: your data is stable and voluminous, the pattern is subtle (writing style, tone, specialized reasoning), and inference speed matters more than data freshness.
  • Use both when: you have stable domain knowledge (fine-tune) plus dynamic data (RAG). This is common in specialized domains — fine-tune on medical terminology, use RAG for patient histories.

Most teams should start with RAG. It’s cheaper to build, easier to iterate, and doesn’t require the data science depth that fine-tuning demands. The performance gap between well-tuned RAG and fine-tuning is narrower than it was two years ago — especially with Claude Sonnet 4 and GPT-4o, which have larger context windows.

The RAG Pipeline: Four Non-Negotiable Stages

A production RAG system has four stages. Skip one and the whole thing breaks.

1. Ingestion. You load your data (PDFs, web pages, databases, structured documents) and convert it into chunks small enough for embedding. A chunk is typically 300–800 tokens — a paragraph, not a book chapter. The goal is granularity: each chunk should answer a single question without requiring context from five other chunks.

2. Embedding. Each chunk gets converted to a vector — a list of numbers representing its semantic meaning. You use an embedding model (like OpenAI’s text-embedding-3-small or Cohere’s embed-english-v3.0) to create these vectors. The vectors live in a vector database (Pinecone, Weaviate, Milvus, or even PostgreSQL with pgvector extension). The embedding model is separate from your LLM.

3. Retrieval. When a user asks a question, you embed the question using the same embedding model, then search the vector database for the most similar chunks. Similarity is measured by cosine distance — how “close” the vectors are to each other. You typically retrieve the top 3–10 chunks, depending on how much context your LLM can handle.

4. Generation. You construct a prompt that includes the retrieved chunks, then send it to your LLM. The LLM generates a response based on both the question and the context. If the context is good, the response is accurate. If the context is wrong, the LLM will still hallucinate — just with higher confidence, because it has something to work with.

Here’s a concrete flow:

# Step 1: Ingestion (one-time, offline)
raw_documents = load_pdfs("./docs/")
chunks = split_into_chunks(raw_documents, chunk_size=500)
embeddings = [embedding_model.embed(chunk) for chunk in chunks]
vector_db.insert(chunks, embeddings)  # Store in Pinecone/Weaviate

# Step 2: Retrieval (per user query, real-time)
user_question = "What's your shipping policy to Canada?"
query_embedding = embedding_model.embed(user_question)
relevant_chunks = vector_db.search(query_embedding, top_k=5)

# Step 3: Generation
context = "\n\n".join([chunk.text for chunk in relevant_chunks])
prompt = f"""
You are a helpful assistant. Answer the user's question based on 
the context below. If the context doesn't contain the answer, 
say so explicitly.

Context:
{context}

Question: {user_question}
"""

response = llm.generate(prompt)  # Claude or GPT-4o
print(response)

This is the skeleton. Every production system adds complexity: handling ambiguous queries, re-ranking results, multi-step retrieval, fallback strategies, and monitoring for drift. But if you understand this four-stage pipeline, you understand RAG.

Embedding Models: Why Your Choice Matters

Most teams pick an embedding model once and forget about it. This is a mistake. Your embedding model determines what “relevant” means.

There are three main categories:

Embedding Model Strengths Weaknesses Cost (per 1M tokens)
OpenAI text-embedding-3-small Fast, good general-purpose performance, tiny (1536 dims). Scores 62.3 on MTEB Proprietary, requires API key, slow for large-scale reembedding $0.02
Cohere embed-english-v3.0 Supports search tasks natively (can optimize for retrieval vs. semantics), MTEB score 55.6, flexible Higher latency than OpenAI small, cost adds up at scale $0.10
Sentence-transformers (open-source, local) Free, runs on your hardware, no API calls, all-MiniLM-L6-v2 is tiny (384 dims) Weaker semantic understanding than commercial models, MTEB ~32, requires GPU for speed $0 (infra cost)

In 2024–2025, the choice usually comes down to:

  • Public SaaS (fast iteration, content diversity): OpenAI’s text-embedding-3-small. It’s cheap, it’s fast, and it’s been battle-tested. Use this unless you have a specific reason not to.
  • Private data (security, compliance, offline): sentence-transformers run locally, or a self-hosted model like llama2-embeddings. You trade some quality for control.
  • Domain-specific retrieval (medical, legal, finance): Try a domain-specific model first (e.g., biosensetransformers for medical), then fall back to text-embedding-3-small if it underperforms.

Here’s a decision tree I use in practice: Start with OpenAI’s small model. If top-k retrieval misses obvious answers (the relevant chunk exists but doesn’t rank in top-5), switch to text-embedding-3-large. If latency becomes a problem and accuracy is still acceptable, downgrade to sentence-transformers and run it locally. Don’t optimize for cost first — optimize for correctness first, cost second.

When Basic Retrieval Fails: Advanced Ranking Techniques

You’ve built the pipeline. You retrieve the top-5 chunks. And sometimes they’re useless.

This happens because vector similarity doesn’t always match semantic relevance. Your query might be “How do I cancel?” and the retriever pulls back five chunks about refunds, when the user actually needs cancellation policy. Similar vectors, different intent.

Two techniques fix this:

Re-ranking. You retrieve more chunks than you need (say, top-20), then use a dedicated re-ranking model to re-score them. The re-ranker is lighter and faster than the main LLM, so it’s cheap to run. Models like Cohere’s reranker-v3-turbo or Jina’s reader-lm score the chunks based on relevance to the original query, not just vector similarity.

Example code:

# Retrieve more candidates
retrieved_chunks = vector_db.search(query_embedding, top_k=20)

# Re-rank them
from cohere import Client
co = Client(api_key="your-key")
reranked = co.rerank(
    query=user_question,
    documents=[chunk.text for chunk in retrieved_chunks],
    top_n=5,
    model="reranker-v3-turbo"
)

# Use the top-5 re-ranked results
final_chunks = [retrieved_chunks[result.index] for result in reranked.results]

Hybrid search. Don’t rely on semantic search alone. Combine it with keyword search (BM25). Semantic search finds “similar ideas,” keyword search finds exact terms. For a query like “shipping to Canada,” keyword search immediately finds chunks containing “Canada” and “shipping.” Semantic search might take a longer path.

Implement hybrid search like this:

# Semantic search (vector similarity)
semantic_results = vector_db.search(query_embedding, top_k=10)

# Keyword search (BM25)
keyword_results = vector_db.bm25_search(user_question, top_k=10)

# Merge and deduplicate
merged = list(dict.fromkeys(semantic_results + keyword_results))

# Re-rank the merged list
final_chunks = rerank(merged, user_question, top_n=5)

In production systems I’ve built (at AlgoVesta, for financial data), hybrid search catches ~15% more relevant results than semantic search alone. The extra query cost is minimal — BM25 is fast.

Chunking Strategies: Size, Overlap, and Why It Matters

Your embedding model sees one chunk at a time. A chunk that’s too small misses context. A chunk that’s too large buries the answer in noise.

There’s no universal chunk size. It depends on your data and your use case:

  • Chunk size 256–512 tokens: Good for Q&A datasets, FAQs, customer support docs. The answer fits in one chunk. Claude counts ~4 tokens per word, so 256 tokens = ~60 words = two paragraphs.
  • Chunk size 512–1024 tokens: Good for technical documentation, product guides, policy documents. The full context is sometimes needed to answer a question — you need the introduction plus the specific section.
  • Chunk size 1024+ tokens: Good for dense, interconnected material (research papers, legal contracts). Context matters across sections, and you want to minimize splits that separate related ideas.

Test this locally before deploying. Here’s a quick test:

# Load a sample document
doc = load_pdf("sample.pdf")

# Test different chunk sizes
for chunk_size in [256, 512, 1024]:
    chunks = split_into_chunks(doc, size=chunk_size)
    print(f"Size {chunk_size}: {len(chunks)} chunks")
    
    # For each test query, check if the answer is fully contained
    # in the retrieved chunk (not split across two chunks)
    for query in test_queries:
        embedding = embedding_model.embed(query)
        top_chunk = vector_db.search(embedding, top_k=1)[0]
        answer_in_chunk = check_answer_coverage(query, top_chunk)
        print(f"  {query}: {'✓' if answer_in_chunk else '✗'}")

Overlap is another variable. If chunk A is tokens 0–512 and chunk B is tokens 513–1024, they’re adjacent with zero overlap. The boundary between them might split a sentence. Instead, use overlap: chunk A is 0–512, chunk B is 256–768. Now the overlapping 256–512 region is represented in both chunks, and sentence boundaries are less likely to cause retrieval misses.

Good default: chunk_size=800, overlap=200. This means 80% unique content, 20% repetition. For technical docs, increase overlap to 300 or 400. For FAQ-style data, reduce to 100.

Building the Feedback Loop: Why Monitoring Beats Intuition

You deploy RAG. Users ask questions. How do you know if it’s working?

Intuition doesn’t scale. You need metrics.

Retrieval metrics (are you fetching the right chunks?):

  • Mean Reciprocal Rank (MRR): For a given query, what position is the correct answer at in your retrieved results? If the answer is in the top result, MRR = 1. Top 5, MRR = 0.2. Average across all queries. Aim for >0.7 in production.
  • Normalized Discounted Cumulative Gain (NDCG): Similar to MRR but accounts for ranking quality across multiple results. More sophisticated, harder to calculate, but better for large result sets.
  • Hit rate (top-k): Percentage of queries where the correct answer appears in your top-k results. Simple and useful. Aim for >85% at k=5.

Generation metrics (does the LLM produce good responses given good context?):

  • BLEU / ROUGE: Measure overlap between generated response and expected response. Imperfect — two different correct answers score differently — but fast to compute at scale.
  • Semantic similarity: Embed both the generated response and expected response, measure cosine distance. More forgiving than BLEU, but slower.
  • Human evaluation (spot checks): Every week, sample 10–20 queries, manually check if the LLM’s response is accurate. Slow but ground truth.

Here’s a minimal monitoring stack:

import json
from datetime import datetime

class RAGMonitor:
    def __init__(self):
        self.metrics = []
    
    def log_retrieval(self, query, retrieved_chunks, correct_chunk_id):
        """Log retrieval results for MRR calculation."""
        rank = next(
            (i for i, chunk in enumerate(retrieved_chunks) 
             if chunk.id == correct_chunk_id), 
            None
        )
        mrr = 1 / (rank + 1) if rank is not None else 0
        
        self.metrics.append({
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "rank": rank,
            "mrr": mrr,
            "type": "retrieval"
        })
    
    def log_generation(self, query, response, expected_response):
        """Log generation results."""
        from sentence_transformers import CrossEncoder
        ce_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
        
        # Semantic similarity score (0–1)
        similarity = ce_model.predict([[query, response, expected_response]])[0]
        
        self.metrics.append({
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "generated": response,
            "expected": expected_response,
            "similarity": similarity,
            "type": "generation"
        })
    
    def report(self):
        retrieval_metrics = [m for m in self.metrics if m["type"] == "retrieval"]
        generation_metrics = [m for m in self.metrics if m["type"] == "generation"]
        
        print(f"Retrieval MRR: {sum(m['mrr'] for m in retrieval_metrics) / len(retrieval_metrics):.3f}")
        print(f"Generation avg similarity: {sum(m['similarity'] for m in generation_metrics) / len(generation_metrics):.3f}")
        print(f"Total queries: {len(self.metrics)}")

The key: collect data from day one. A month of data tells you if your system is drifting. Six months tells you what happens when your data grows or user patterns change.

Common Failure Modes and How to Fix Them

RAG looks simple on paper. Here’s where it breaks:

1. The retrieved chunk is relevant but too generic. You retrieve a chunk that mentions the topic but doesn’t answer the specific question. Example: query is “Can I return items after 30 days?” and you retrieve a chunk that says “We have a return policy,” but not the specific timeframe.

Fix: Reduce chunk size. The chunk now contains just the return timeframe, not the whole returns section. Trade off: you might retrieve more chunks per query, but each one is more specific.

2. The embedding model doesn’t understand your domain. You’re in healthcare, and the model trained on general internet text doesn’t know that “EHR” and “electronic health record” are the same thing. Your chunks about EHR don’t match queries that say “electronic health record.”

Fix: Use a domain-specific embedding model (BioBERT for biology, DistilBERT-medical for healthcare) or fine-tune an embedding model on domain pairs. Fine-tuning embeddings is cheaper than fine-tuning LLMs — a few hundred labeled examples are enough.

3. The LLM ignores the context and hallucinates anyway. You give the LLM perfect context, and it still makes up an answer. This usually happens when the LLM is uncertain — it has a bit of the answer and fills in the rest.

Fix: Use a chain-of-thought prompt that forces the LLM to cite the context. Instead of “Answer the question,” try “Answer the question based only on the context below. If the answer isn’t in the context, say ‘I don’t know.’ Quote the relevant sentence that supports your answer.”

4. Latency creep. You started with 5 chunks retrieved. Now you’re retrieving 20, re-ranking, and doing semantic search + keyword search. The whole pipeline takes 3 seconds. For a chat interface, that’s unacceptable.

Fix: Profile each stage. Is embedding slow? Use a smaller embedding model. Is retrieval slow? Switch to Pinecone or Weaviate (they’re optimized for this). Is re-ranking slow? Do it asynchronously, return top-5 while re-ranking in the background. For AlgoVesta’s trading data retrieval, we cache the top embeddings for common queries and do real-time updates only for new data — 80% latency reduction.

5. Vector database costs explode. Pinecone and similar services charge per vector stored. You have millions of chunks. The bill is painful.

Fix: Use open-source options (Milvus, Weaviate, Qdrant) self-hosted, or add PostgreSQL with pgvector. For smaller datasets (<1M vectors), PostgreSQL + pgvector is cheap and sufficient. For larger datasets, Milvus on Kubernetes costs nothing but takes engineering effort.

Production Checklist: What You Need Before Launch

RAG in development is different from RAG in production. Here’s the checklist:

  • [ ] You’ve tested chunk size against your actual data with actual test queries. Retrieval hit rate ≥85% at top-5.
  • [ ] You’ve chosen an embedding model and measured baseline performance. Switching embedding models later is expensive (requires re-embedding all data).
  • [ ] You have a data update process. When docs change, your RAG system is updated within X hours. Is this automated or manual?
  • [ ] You’ve built a fallback. If retrieval fails, what does the LLM do? Hallucinate? Refuse to answer? (Refuse is better.)
  • [ ] You have monitoring in place. MRR and hit rate on a weekly dashboard. You can spot performance drops before users complain.
  • [ ] You’ve tested against adversarial queries. What happens when a user asks something your data doesn’t cover? Does the system say “I don’t know” or confidently guess?
  • [ ] You have a cost model. Embedding cost + vector DB cost + LLM cost per request. Is it sustainable?

RAG Stack Comparison: What to Use in 2025

Component Option 1 Option 2 Option 3
Embedding Model OpenAI text-embedding-3-small ($0.02/1M tokens, proprietary) Cohere embed-v3.0 ($0.10/1M tokens, native search optimization) sentence-transformers (free, open-source, runs locally)
Vector Database Pinecone ($0.04–0.40 per 1M vectors/month, managed) PostgreSQL + pgvector (self-hosted, free software, slower) Milvus (open-source, Kubernetes, fast, ops overhead)
Re-ranker Cohere reranker-v3-turbo ($1.00 per 1M tokens) Jina reader-lm (open-source, runs locally) No re-ranker (faster, lower accuracy)
LLM Claude Sonnet 4 ($3/$15 per 1M in/out tokens, long context) GPT-4o ($5/$15 per 1M in/out tokens, faster) Open-source (Llama 3 70B, free, slower)

My recommendation for most teams: OpenAI embeddings + Pinecone + no re-ranker initially (add re-ranking if hit rate drops below 80%) + Claude Sonnet 4. This combination is straightforward, has good performance, and costs scale with usage. If you’re cost-conscious and can handle engineering overhead, go PostgreSQL + pgvector + sentence-transformers + GPT-4o.

What to Build This Week

Start with a 30-minute prototype. Don’t overthink it.

  1. Pick a small dataset you know well (your own documentation, a public dataset, anything under 100 documents).
  2. Load it, chunk it (500-token chunks, 20% overlap), and embed it using OpenAI’s text-embedding-3-small.
  3. Store the embeddings in a simple vector DB (Pinecone free tier or local Weaviate in Docker).
  4. Write a function that takes a user question, retrieves the top-5 chunks, and passes them to Claude with a grounding prompt (the one shown earlier in this article).
  5. Test it with 10 questions you know the answer to. Do the retrieved chunks actually help the LLM answer correctly?

This gives you ground truth on your own data. From there, you’ll know what to optimize: embedding quality, chunk size, re-ranking, or LLM choice.

Batikan
· 11 min read
Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Related Articles

Stop Your AI Content From Reading Like a Bot
Learning Lab

Stop Your AI Content From Reading Like a Bot

AI-generated content defaults to corporate patterns because that's what models learn from. Lock in authenticity using constraint-based prompting, specific personas, and reusable system prompts that eliminate generic phrasing.

· 4 min read
LLMs for SEO: Keyword Research, Content Optimization, Meta Tags
Learning Lab

LLMs for SEO: Keyword Research, Content Optimization, Meta Tags

LLMs can analyze search intent from SERP content, cluster keywords by actual user need, and generate high-specificity meta descriptions. Learn the exact prompts that work in production, with real examples from ranking analysis.

· 5 min read
Context Window Management: Fitting Long Documents Into LLMs
Learning Lab

Context Window Management: Fitting Long Documents Into LLMs

Context window limits break production systems more often than bad prompts do. Learn token counting, extraction-first strategies, and hierarchical summarization to handle long documents and conversations without losing information or exceeding model limits.

· 5 min read
Prompts That Work Across Claude, GPT, and Gemini
Learning Lab

Prompts That Work Across Claude, GPT, and Gemini

Claude, GPT-4o, and Gemini respond differently to the same prompts. This guide covers the universal techniques that work across all three, model-specific strategies you can't ignore, and a testing approach to find what actually works for your use case.

· 11 min read
50 ChatGPT Prompts for Work: Copy-Paste Templates That Actually Work
Learning Lab

50 ChatGPT Prompts for Work: Copy-Paste Templates That Actually Work

50 copy-paste ChatGPT prompts designed for real work: email templates, meeting prep, content outlines, and strategic analysis. Each prompt includes the exact wording and why it works. No fluff.

· 5 min read
Generate a Month of Social Posts in 60 Minutes
Learning Lab

Generate a Month of Social Posts in 60 Minutes

Generate a full month of social media posts in one batch with a structured AI prompt. Learn the template that produces platform-ready content, real examples for SaaS and product teams, and the workflow pattern that scales to multiple platforms.

· 1 min read

More from Prompt & Learn

CapCut AI vs Runway vs Pika: Production-Grade Video Editing Compared
AI Tools Directory

CapCut AI vs Runway vs Pika: Production-Grade Video Editing Compared

Three AI video editors. Tested on real production work. CapCut handles captions and silence removal fast and free. Runway delivers professional generative footage but costs $55/month. Pika is fastest at generative video but skips captioning. Here's exactly which one fits your workflow—and how to build a hybrid stack that actually saves time.

· 11 min read
TechCrunch Disrupt 2026 Early Bird Pricing Ends April 10
AI News

TechCrunch Disrupt 2026 Early Bird Pricing Ends April 10

TechCrunch Disrupt 2026 early bird passes expire April 10 at 11:59 p.m. PT, with discounts up to $482 vanishing after the deadline. If you're planning to attend, the window to lock in the lower rate closes in four days.

· 2 min read
Superhuman vs Spark vs Gmail AI: Email Speed Tested
AI Tools Directory

Superhuman vs Spark vs Gmail AI: Email Speed Tested

Superhuman drafts replies in 2–3 seconds but costs $30/month. Spark takes 8–12 seconds at $9.99/month. Gmail's built-in AI doesn't auto-suggest replies at all. Here's what each one actually does well, what breaks, and which fits your workflow.

· 5 min read
Suno vs Udio vs AIVA: Which AI Music Generator Actually Works
AI Tools Directory

Suno vs Udio vs AIVA: Which AI Music Generator Actually Works

Suno, Udio, and AIVA all generate music with AI, but they solve different problems. This comparison covers model architecture, real costs per track, quality benchmarks, and exactly when to use each—with workflows for rapid iteration, professional audio, and structured composition.

· 9 min read
Xoople’s $130M Series B: Earth Mapping for AI at Scale
AI News

Xoople’s $130M Series B: Earth Mapping for AI at Scale

Xoople raised $130 million to build satellite infrastructure for AI training data. The partnership with L3Harris for custom sensors signals a serious technical moat — but success depends entirely on whether fresh Earth imagery actually improves model accuracy.

· 3 min read
Figma AI vs Canva AI vs Adobe Firefly: Design Tool Showdown
AI Tools Directory

Figma AI vs Canva AI vs Adobe Firefly: Design Tool Showdown

Figma AI, Canva AI, and Adobe Firefly each solve different design problems. This comparison breaks down image generation quality, pricing, and when to actually buy each one.

· 4 min read

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies. No noise, only signal.

Follow Prompt Builder Prompt Builder