Learning Lab March 26, 2026 · 6 min read

Fine-Tuning vs Prompt Engineering vs RAG: Which Actually Works

Three paths to better LLM performance: prompt engineering, RAG, and fine-tuning. Learn exactly when to use each, why teams pick wrong, and the cost-benefit math that determines which actually makes sense for your use case.

You have a problem your off-the-shelf LLM won’t solve. Claude or GPT-4 handles basic tasks fine, but for your specific use case — extracting data from contract clauses, grading essays in your curriculum, or classifying customer support tickets with 95% accuracy — the standard model drifts. So now you’re choosing between three paths: fine-tuning the model, engineering better prompts, or building a RAG system. Each one works. Each one fails for different reasons. The difference between picking right and picking wrong is the difference between shipping this week and sinking three months into a solution that cost too much and runs too slow.

Know What You’re Actually Trying to Fix

Before you commit to any approach, diagnose the actual problem.

Run 20-30 examples through your baseline model. Save the failures. Look at the pattern. Does the model:

Hallucinate facts it doesn’t know? RAG solves this. Fine-tuning won’t.
Misunderstand your domain’s specific terminology or logic? Fine-tuning or very specific prompting fixes this.
Forget instructions halfway through a long response? Prompt engineering (structure, token optimization) fixes this. Sometimes RAG helps.
Make sloppy mistakes in formatting or reasoning it clearly understands? Prompt engineering almost always fixes this first.

I’ve watched teams spend $40k fine-tuning a model to handle hallucinations when a RAG layer would have cost $800 and shipped in two weeks. Diagnosis comes before prescription.

Prompt Engineering: Your First Stop

Start here. Always.

Prompt engineering is free (in compute terms), reversible, and solves 60% of problems people think require fine-tuning. A disciplined approach takes four hours, not four weeks.

Here’s a realistic before/after:

# Bad prompt
Extract the key contract terms from this document:

{document_text}

Failure mode: The model picks random terms, misses critical dates, confuses effective dates with termination dates.

# Improved prompt
You are a legal contract analyst. Extract only the following terms from the contract:
- Effective Date (when the contract begins)
- Termination Date (when it ends)
- Renewal Terms (automatic or manual)
- Payment Amount (total contract value)
- Payment Schedule (when payments are made)

For each term:
1. Find the exact clause reference (e.g., "Section 3.2")
2. Quote the relevant text
3. Provide the extracted value
4. If a term is not present, write "NOT FOUND" — do not guess

Document:
{document_text}

Analysis:

Better. Not magic — you’ve just removed ambiguity and added explicit instructions for handling missing data. In practice, this approach drops extraction errors by 40-60% on structured tasks. Use grounding prompts (feeding the model context it needs) and chain-of-thought (making the model show its work) before you think about fine-tuning.

When prompt engineering alone fails: You’ve optimized the prompt, tested it across 100+ examples, and you’re still getting 75% accuracy when you need 95%. That’s your signal to move to the next layer.

RAG: For Knowledge Your Model Doesn’t Have

RAG (Retrieval-Augmented Generation) works by feeding the model relevant context it doesn’t have in its training data. You give it a vector database of your domain documents, and before each response, the system retrieves the most relevant passages.

Build RAG when:

Your model needs to cite specific sources or answer questions about your proprietary data
Your knowledge base changes monthly (contracts, policies, product docs)
You need to reduce hallucinations on factual questions
You want the model to say “I don’t know” rather than guess

Here’s a practical Python example using Claude and a mock retrieval function:

import anthropic

def retrieve_context(query: str) -> list[str]:
    # In production, this queries a vector database (Pinecone, Weaviate, etc.)
    # For demo, returning mock documents
    docs = {
        "refund_policy": "Refunds are allowed within 30 days of purchase.",
        "shipping": "Standard shipping takes 5-7 business days.",
    }
    return [docs[k] for k in docs if any(word in query.lower() for word in docs[k].lower().split())]

def rag_query(user_question: str) -> str:
    client = anthropic.Anthropic()
    
    # Retrieve relevant documents
    retrieved_docs = retrieve_context(user_question)
    context = "\n".join(retrieved_docs) if retrieved_docs else "No relevant documents found."
    
    # Build prompt with context
    prompt = f"""Answer the question using ONLY the provided context. 
If the context doesn't contain the answer, say 'I don't have that information.'

Context:
{context}

Question: {user_question}"""
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text

# Test
print(rag_query("Can I return something I bought last week?"))
# Output: "Yes, according to our refund policy, refunds are allowed within 30 days of purchase."

RAG’s strength: It’s cheap to maintain, handles fresh data without retraining, and cuts hallucinations on factual tasks by 70-90%. Its weakness: It only helps if your problem is missing knowledge, not reasoning or format.

Fine-Tuning: When the Basics Aren’t Enough

Fine-tuning teaches a model a new skill or consistent style it doesn’t naturally exhibit. This costs time and money — Anthropic’s fine-tuning (as of March 2025) runs $3 per 1M input tokens and $15 per 1M output tokens, on top of the base API cost.

Fine-tune when:

You need the model to follow a very specific output format (XML structure, JSON with exact keys) consistently
The model needs to learn domain-specific reasoning (e.g., grading rubrics your teaching platform uses)
Prompt engineering has plateaued — you’re at 85% accuracy and stuck there
You need this behavior in 500+ API calls per month (cost-benefit becomes favorable)

Fine-tuning will NOT fix hallucinations. It will NOT teach the model facts it doesn’t know. It WILL teach it patterns.

Here’s the cost math: If you run 1000 inference calls/month with Claude 3.5 Sonnet, each call ~2000 output tokens, that’s 2B tokens/month × $15/1M = $30/month in inference cost. Fine-tuning setup is ~$100-300. Payoff happens around month 5 if fine-tuning cuts latency or improves output quality enough to skip human review on 30% of responses. Below that threshold, it’s not worth it.

The Decision Matrix

Problem	First Try	If That Fails
Model makes up facts	RAG + grounding prompt	Upgrade model (GPT-4o vs Sonnet)
Wrong output format	Structured prompt + examples	Fine-tuning or JSON mode
Misses domain concepts	Few-shot prompting + definitions	Fine-tuning (150+ labeled examples)
Forgets instructions mid-response	Prompt engineering (token optimization)	Break task into smaller steps
Slow inference latency	Use smaller model (Sonnet vs Opus)	Fine-tune or local model (Mistral 7B)

What to Do Today

Take your hardest failing case — the one you’ve been thinking about fine-tuning or RAG-ing. Run it through a properly structured prompt first. Use the contract extraction template above if it’s a data problem, or write a chain-of-thought prompt if it’s a reasoning problem. Test it against 20 examples. Count the accuracy.

If you hit 90%+ accuracy with prompt engineering alone, stop. You’re done. If you’re stuck at 75-85%, diagnose: Is the model missing knowledge (RAG) or missing a skill (fine-tuning)? That distinction will save you weeks.

Batikan

March 26, 2026 · 6 min read

Topics & Keywords

Learning Lab #claude vs gpt-4 #fine-tuning strategy #llm optimization #prompt engineering techniques #rag implementation model prompt engineering fine-tuning rag context docs need contract

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Claude now autonomously controls your computer for Code and Cowork users. Tasks run unattended on macOS, no setup required. This is a research preview with real constraints—here's what works and what doesn't.

Mar 24, 2026 · 3 min read

→

AI News

Google’s Pixel 10 Ads Backfire: When Marketing Gets the Message Wrong

Google's new Pixel 10 ads suggest lying to your friends is a reasonable response to deceptive vacation rentals. The tech works. The message doesn't. Here's why this happens in production AI systems — and how to avoid it.

Mar 24, 2026 · 3 min read

→

Know What You’re Actually Trying to Fix

Prompt Engineering: Your First Stop

RAG: For Knowledge Your Model Doesn’t Have

Fine-Tuning: When the Basics Aren’t Enough

The Decision Matrix

What to Do Today

📚 Related Articles

Stay ahead of the AI curve

Related Articles

Build Professional Logos in Midjourney: Step-by-Step Brand Asset Workflow

AI Tools for Small Business: Automate Tasks Without Hiring

Running Llama 3, Mistral, and Phi Locally: Hardware Setup and First Inference

Cut API Costs 60% Without Sacrificing Quality

AI Tools for Small Business: Automate Tasks Without Hiring

AI Assistants That Actually Work: Architecture, Tools, and Deployment

More from Prompt & Learn

CapCut AI vs Runway vs Pika: Video Editing Tools Compared

GitHub Copilot vs Cursor vs Windsurf: Which Coding Assistant Wins in 2026

Notion AI vs Cursor vs Claude: Which Saves 10+ Hours Weekly

Data Analysis Tools Compared: Julius vs ChatGPT vs Claude

Claude Now Controls Your Computer. Here’s What Changes

Google’s Pixel 10 Ads Backfire: When Marketing Gets the Message Wrong

Stay ahead of the AI curve