Learning Lab April 14, 2026 · 13 min read

When to Fine-tune, Prompt, or Use RAG — A Decision Framework

Fine-tuning, prompt engineering, and RAG solve different problems at different costs. This framework shows you which one actually fits your constraints — with real data, decision matrices, and production-tested workflows.

Three months into building AlgoVesta’s trading signal extraction layer, we burned $8,000 on fine-tuning GPT-3.5 across 15,000 financial documents. The accuracy improved 3%. Then we switched to a different prompting approach and RAG architecture — same data, same models, different structure. Accuracy jumped to 89%. The problem wasn’t the technique. It was using the wrong technique for the problem.

Fine-tuning, prompt engineering, and retrieval-augmented generation (RAG) are not interchangeable tools competing for the same job. They solve different problems, operate at different costs, and require different maintenance burden. Most teams pick one because it sounds familiar or because they read a tutorial, not because it matches their actual constraints.

This framework cuts through that. It’s built on repeated production failures — mine and others — not on theoretical comparisons.

The Core Differences: What Each Actually Does

Start here because the distinctions matter.

Prompt engineering shapes the model’s behavior without changing its weights. You write instructions, structure examples, and design the input format. The model itself stays frozen. Cost is per-token. Latency is deterministic. Changes take seconds.

Fine-tuning updates the model’s weights by training it on your specific data. You create a dataset, run a training job (hours to days), and deploy a new model version. Cost includes training time upfront, then per-token inference. It’s permanent unless you retrain.

RAG doesn’t modify the model at all. It retrieves relevant context from an external database, injects it into the prompt, and lets the base model respond with grounded information. Cost depends on retrieval speed and the size of context injected. It scales with your knowledge base size, not your training data.

The distinction looks obvious when written out. In practice, teams confuse them constantly. A common error: someone trains a fine-tuned model to “improve accuracy” when the real problem is that the prompt wasn’t specific enough. Another: building a RAG system when fine-tuning would be faster and cheaper.

When Prompt Engineering Is Enough (And When It Stops Working)

Prompt engineering is your first move. Always.

It works when the base model already understands the concept and just needs better instructions. It costs nothing upfront. You get feedback in seconds. You can iterate 20 times in an hour. For classification, summarization, and structured extraction from text the model has seen during pretraining, prompt engineering often solves 70–85% of the problem.

Here’s a real before/after from extracting risk factors from SEC filings using Claude Sonnet 4:

## Bad prompt
Extract the main risk factors from this document.

## Output
- General market conditions could impact our business
- We face competition from other companies
- Regulations may change
- We depend on key personnel
(Vague, misses specificity, conflates severity levels)

## Improved prompt
From the attached 10-K filing, extract all risk factors that appear
in the "Item 1A. Risk Factors" section. For each risk:

1. Include the exact heading from the filing
2. Provide a 1-sentence core claim (what bad thing could happen)
3. Flag if the company quantifies likelihood or impact (if so, include it)
4. Mark as: MARKET / OPERATIONAL / REGULATORY / STRATEGIC / FINANCIAL
5. If the filing explicitly states this risk declined vs prior year,
   note that

Omit boilerplate risks that appear in every 10-K (e.g., "general
 economic conditions"). Focus on risks specific to this company's business.

## Output
- Cybersecurity Breaches and Data Losses (OPERATIONAL):
  Could compromise customer data and trigger regulatory action.
  Company estimated exposure at "up to $500M" if "material breach"
  occurred.
- Loss of Key Partnerships (STRATEGIC):
  Specific to this company. Two of top 5 revenue channels depend
  on contracts renewing annually.
(Specific, hierarchical, sortable)

The second prompt works because it:

Specifies output structure (the model follows structured instructions better than vague requests)
Adds negative examples (what NOT to include helps as much as what to include)
Creates a classification system (MARKET/OPERATIONAL/etc. — this gives the model a decision tree)
Quantifies precision (“1-sentence core claim” not “summarize the risk”)

Prompt engineering stops working when:

The concept is outside the model’s training data. You can’t prompt your way around this. If you ask GPT-4o to extract domain-specific jargon from medical imaging reports and the model never saw that data during pretraining, it will guess.
You need consistent behavior across 10,000+ similar queries. Prompt variance compounds. Some requests will hit edge cases and the prompt won’t handle them. You need systematic retraining of behavior.
Your accuracy floor is ~85–90% but you need 96%+. Prompt engineering plateaus. Small gains after that require either fine-tuning or architectural changes (like RAG).
The task requires facts not in the training data. If you’re extracting 2024 product pricing and your model was trained through April 2023, prompting won’t help. RAG will.

RAG: When Your Problem Is Missing or Outdated Context

RAG solves a specific, common problem: the model knows how to think but lacks the facts it needs.

You build a RAG system when:

Your knowledge updates faster than model training cycles. If you retrain a fine-tuned model every week to add new data, RAG is cheaper. You update a vector database instead. Cost per query is higher (retrieval + embedding lookup) but you avoid retraining overhead.
Your accuracy is failing because of hallucinations, not reasoning. The model knows how to extract information and structure it — it just invents details when it doesn’t know something. RAG grounds it in actual data.
You need attribution. “Which document did this fact come from?” Fine-tuning can’t answer that. RAG can because it explicitly retrieves the source.
Your task involves facts the model genuinely never saw. A fine-tuned model can’t learn something that wasn’t in its training data. A RAG system can retrieve it from a database.

Real example from AlgoVesta: we process 500+ research reports daily and extract buy/sell signals. The reports reference specific metrics and dates. Prices and date context shift constantly. We built RAG by:

Chunking each report into 300–500 token sections (semantic boundaries matter — splitting mid-sentence kills context)
Embedding each chunk with OpenAI’s text-embedding-3-small (cheap, fast, good enough for financial data)
Storing embeddings + original text in a vector database (Pinecone, in our case)
At query time: embed the user’s question, retrieve top-5 semantically similar chunks, inject them into the prompt, send to Claude Sonnet 4

The key metric: without RAG, the model hallucinated specific numbers ~15% of the time. With RAG, 1.2%. We didn’t fine-tune. We didn’t improve the prompt (much). We just gave the model the source material it needed.

RAG fails when:

Your retrieval logic is broken. If your vector database returns irrelevant chunks, the model will work with bad input. You’ve just moved the problem upstream. You need to validate retrieval quality — actually look at top-5 results for a sample of queries.
You’re injecting too much context. Claude Sonnet 4’s context window is 200K tokens. Sounds infinite until you’re injecting 15 chunks × 500 tokens each, plus a long prompt, plus the user’s request. Context bloat causes latency and cost spikes. More context doesn’t always improve accuracy — it can introduce noise.
Your embedding model doesn’t match your domain. If you use a general-purpose embedding model on highly specialized data (legal contracts, medical records, trading signals), semantic similarity breaks down. You might need domain-specific embeddings or fine-tuned retrievers.
Your knowledge base is small or static. If you have 50 documents that rarely change and your data is from the model’s training period, RAG adds complexity for minimal gain. Prompt engineering is faster.

Fine-tuning: When You Need Systematic Behavior Change

Fine-tuning is your last resort — not because it’s weak, but because it’s expensive and creates a deployment burden.

You fine-tune when:

The base model struggles with your specific format or domain consistently. If prompt engineering gets you to 75% and you’ve exhausted prompt variations, fine-tuning can push you to 88–92% by reweighting the model toward your data distribution.
You need latency guarantees that prompt engineering can’t provide. A fine-tuned model is smaller, faster, and more predictable than base models with long, complex prompts. If you’re serving 10K queries/sec and latency SLAs are tight, this matters.
Cost per query is your constraint, not upfront cost. Fine-tuning is expensive upfront ($50–$500 depending on dataset size and model) but per-token inference is cheaper than base model inference. If you’re running high volume, the math flips.
You need consistent behavior on edge cases your prompt can’t cover. If you’ve built a 2,000-token prompt trying to handle 12 different input variations and it still fails 8% of the time, a fine-tuned model trained on those variations will be more robust.

Example: a customer support team using GPT-4o to classify tickets. Prompt engineering got them to 82% accuracy across 50 ticket categories. They manually labeled 2,000 tickets, fine-tuned Mistral 7B (smaller model, faster inference), and hit 91% accuracy. Upfront: 4 hours of labeling + $60 fine-tuning cost. Monthly savings: $1,200 in API calls (fewer GPT-4o tokens needed). Payback: 3 weeks.

Fine-tuning fails when:

You don’t have high-quality training data. If your labeled dataset is small (<500 examples) or noisy, fine-tuning will overfit or learn spurious patterns. Prompt engineering is more robust with limited data.
Your problem is domain-specific and your model is too general. Fine-tuning GPT-4o on 1,000 molecular structure classification examples won’t teach it chemistry. A domain-specific model or specialized embedding approach would be better.
Your data distribution shifts frequently. Fine-tuned models degrade when they encounter data unlike their training set. If you fine-tune today and your input patterns change next month, you need to retrain. RAG handles distribution shift better because you update the knowledge base, not the model.
You need to deploy multiple variants. If you fine-tune one model for English and one for Spanish and one for a different domain, you now manage three separate models, three separate deployments, three separate monitoring pipelines. This operational overhead is real.

The Decision Matrix: Concrete Guidance for Your Problem

Use this table to shortcut the analysis. Find your constraint on the left, follow the row, and it points to the right tool.

Constraint / Requirement	Prompt Engineering	RAG	Fine-tuning
Need >90% accuracy on domain task	❌ Often plateaus at 80–85%	✅ If data is external	✅ Primary choice
Knowledge updates daily/weekly	⚠️ Only if facts in prompt	✅ Best choice	❌ Retraining overhead
Latency SLA <200ms, high volume	❌ Long prompts slow inference	⚠️ Retrieval adds 50–100ms	✅ Smaller, faster models
Need fact attribution (which source?)	❌ Can’t trace reasoning	✅ Returns source chunks	❌ Can’t trace reasoning
Task involves facts outside model training data	❌ Model can’t know it	✅ Best choice	⚠️ Only if in training data
Limited labeled data (<500 examples)	✅ Best choice	✅ Works well	❌ Overfitting risk
Cost per query is critical (>10K queries/month)	⚠️ Long prompts = expensive	⚠️ Medium cost	✅ Best at scale
Need to adjust behavior based on feedback	✅ Changes deploy in seconds	⚠️ Retrieval logic changes slowly	❌ Retraining required

A Real Decision Flow: Three Case Studies

Case 1: Legal Contract Risk Flagging

Startup needs to scan incoming contracts and flag unusual terms. They have 200 labeled contracts. Their base model (Claude Sonnet 4) catches 70% of risks but misses domain-specific language.

Decision: Start with prompt engineering (2 hours, free). Results: 78%. Then add RAG using their labeled contracts as a knowledge base (8 hours setup, $50 embedding costs). Results: 86%. Accuracy ceiling feels near. Fine-tuning would require 1,000+ labeled contracts (weeks of work). Decision: stop at RAG. Cost: ~$2/month in vector database. This works for 3–6 months until they have more labeled data.

Case 2: Product Review Classification (50K reviews/month)

SaaS company needs to classify 50K monthly reviews into 12 sentiment categories. GPT-4o API costs $400/month for the task. They label 3,000 examples and fine-tune Mistral 7B. Cost: $120 fine-tuning + $8/month inference (using Together AI for cheap inference). Results: 88% accuracy (vs. 82% with prompt engineering). Monthly savings: $370. Fine-tuning ROI: 4 weeks.

Decision: fine-tuning. But here’s the catch: they redeploy a new fine-tune every month as new reviews arrive. After 6 months they’ve run 6 fine-tuning cycles. At month 7, the newest version performs worse on old patterns (drift). They should have invested in RAG or a refreshed prompting strategy instead. Live and learn.

Case 3: Q&A Over Company Documentation (1,000+ pages)

HR department wants employees to ask questions about benefits, policies, and procedures. They have a 1,500-page handbook that updates quarterly. Base model (GPT-4o) hallucinates policy details 12% of the time.

Option A: Prompt engineering. Inject the entire handbook into the prompt. Results: 100KB+ token overhead per query. Latency: 4–5 seconds. Cost: $0.80 per query. Hallucinations: 5%. Untenable at scale.

Option B: RAG. Chunk the handbook (1,500 pages = ~1,000 chunks). Embed once. At query time: retrieve top-5 relevant chunks, inject them (~2KB tokens), send to model. Latency: 800ms. Cost: $0.12 per query. Hallucinations: 0.8%. Decision: obvious. Build RAG. Implementation: Pinecone + Claude Sonnet 4 = two weeks start-to-finish.

Hybrid Approaches: When One Technique Isn’t Enough

The real world rarely picks one. Teams that succeed combine them strategically.

Prompt engineering + RAG (common, effective)

You use RAG to ground facts, then use prompt engineering to guide reasoning. Example: Extract risk factors from a 10-K (RAG retrieves relevant sections) and then classify their severity using a structured prompt template (prompt engineering). Cost: retrieval overhead + inference. Latency: acceptable. Accuracy: high.

Fine-tuning + RAG (advanced, expensive)

You fine-tune a model to understand your domain, then use RAG to give it current facts. Example: fine-tune a medical model to classify patient symptoms, but use RAG to ground it in current drug interaction data that wasn’t in training. Cost: high (two systems to maintain). Benefit: domain understanding + factual accuracy. Use case: high-stakes domains where both matter.

Prompt engineering + prompt engineering (cascading) (underrated)

You use a cheap model for initial filtering, then a powerful model for refined output. Example: GPT-3.5 classifies whether an email is support-related (cheap, fast), then Claude Sonnet 4 drafts a response only if it is (saves money). Cost: two API calls but 70% cheaper overall. Latency: acceptable if first model is fast. Works well for routing and filtering tasks.

Benchmarking Your Approach: What to Measure

You can’t pick the right technique without baseline metrics. Before building anything, measure:

Accuracy on a holdout test set (50–100 examples you don’t train on). Accuracy is context-dependent — 80% might be great for some tasks, terrible for others. Know your target.
Cost per query (tokens × model price). Include retrieval costs if using RAG. Include fine-tuning amortized over expected query volume if fine-tuning.
Latency percentiles, not averages (p50, p95, p99). Average latency hides tail slowness. If 1% of queries take 30 seconds, it breaks user experience even if average is 2 seconds.
Failure modes (not just overall accuracy). Classify your errors: hallucinations, reasoning mistakes, format errors, ambiguous input. Different techniques fail differently. Hallucinations point toward RAG. Format errors point toward better prompting or fine-tuning.
Variance across input types (not one monolithic accuracy number). If your model works great on short inputs but fails on long ones, you have a structural problem, not a technique problem.

Example from AlgoVesta: we measured trading signal extraction accuracy not as a single number but as:

Accuracy on recent reports (2024): 91%
Accuracy on archived reports (2020–2022): 73% (knowledge gap)
Accuracy when price data is referenced: 88%
Accuracy on edge-case tickers: 62% (rare stocks, hallucinations spike)

This breakdown revealed that RAG solved the 2020–2022 problem and the price reference problem (they’re in the knowledge base), but didn’t solve the edge-case ticker problem (model needs fine-tuning or better ticker disambiguation in the prompt). One number wouldn’t have told us that.

What to Do Starting Monday

If you’re facing this decision now:

Step 1: Pick a small sample (10–50 examples) from your use case and measure baseline accuracy with the cheapest option first: GPT-3.5 Turbo with a simple prompt. This takes 30 minutes and gives you a floor. Might be good enough. Often isn’t.

Step 2: Write a more specific prompt (reference the section earlier for prompt template rules). Measure again.** Usually 5–15% improvement with zero cost. Stop if this achieves your target accuracy. Don’t over-engineer.

Step 3: If prompt engineering plateaus, determine which of these is your actual blocker:

Missing facts (e.g., data outside model training window): build RAG

Consistent hallucinations on the same edge cases: build RAG or fine-tune

Format/structure mistakes (model understands but doesn’t output right format): improve prompting or fine-tune

Latency or cost at scale: fine-tune or switch to smaller model

Step 4: If you choose RAG: implement a small RAG system first (one vector database, one retriever, one model). Measure actual retrieval quality — spot-check top-5 results on 10 queries. If retrieval is broken, accuracy will be broken. Don’t iterate on prompting when retrieval is the problem.

Step 5: If you choose fine-tuning: collect 500+ labeled examples first. If you have fewer, prompt engineering or RAG will likely outperform because fine-tuning will overfit. Don’t skip this step to save time.

The decision isn’t permanent. You’ll start with one approach, hit its limits, add another layer. That’s normal. The teams that avoid catastrophic overspending are the ones who measure at each step instead of betting everything on a single technique upfront.

📚 Related Articles

When to Pay for AI Tools: A Cost-Benefit Framework

Vector Databases Explained: When RAG Actually Needs Pinecone

Zero-Shot vs Few-Shot vs Chain-of-Thought: When Each Works

Batikan
April 14, 2026 · 13 min read

Topics & Keywords

Learning Lab #fine-tuning decision framework #llm accuracy optimization #prompt engineering techniques #rag systems production model prompt prompt engineering rag fine-tuning data accuracy cost

Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

← Previous
Notion AI vs Mem vs Obsidian: Which Note App Scales

Next →
Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means

Related Articles

Learning Lab
Context Window Management: Processing Long Docs Without Losing Data

Context window limits break production AI systems. Learn three concrete techniques to handle long documents and conversations without losing data or burning API costs.

Apr 16, 2026 · 3 min read
→

Learning Lab
Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Learn how to build production-ready AI agents by mastering tool calling contracts, structuring agent loops correctly, and separating memory into session, knowledge, and execution layers. Includes working Python code examples.

Apr 15, 2026 · 5 min read
→

Learning Lab
Connect LLMs to Your Tools: A Workflow Automation Setup

Connect ChatGPT, Claude, and Gemini to Slack, Notion, and Sheets through APIs and automation platforms. Learn the trade-offs between models, build a working Slack bot, and automate your first workflow today.

Apr 15, 2026 · 5 min read
→

Learning Lab
Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

Zero-shot, few-shot, and chain-of-thought are three distinct prompting techniques with different accuracy, latency, and cost profiles. Learn when to use each, how to combine them, and how to measure which approach works best for your specific task.

Apr 15, 2026 · 15 min read
→

Learning Lab
10 ChatGPT Workflows That Actually Save Time in Business

ChatGPT saves hours when you give it structure and clear constraints. Here are 10 production workflows — from email drafting to competitive analysis — that cut repetitive work in half, with working prompts you can use today.

Apr 15, 2026 · 6 min read
→

Learning Lab
Stop Generic Prompting: Model-Specific Techniques That Actually Work

Claude, GPT-4o, and Gemini respond differently to the same prompt. Learn model-specific techniques that exploit each one's strengths—with working examples you can use today.

Apr 15, 2026 · 2 min read
→

More from Prompt & Learn

AI Tools Directory
Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

Figma AI, Canva AI, and Adobe Firefly take different approaches to generative design. Figma prioritizes seamless integration; Canva prioritizes speed; Firefly prioritizes output quality. Here's which tool fits your actual workflow.

Apr 16, 2026 · 4 min read
→

AI Tools Directory
DeepL Adds Voice Translation. Here’s What Changes for Teams

DeepL announced real-time voice translation for Zoom and Microsoft Teams. Unlike existing solutions, it builds on DeepL's text translation strength — direct translation models with lower latency. Here's why this matters and where it breaks.

Apr 16, 2026 · 3 min read
→

AI Tools Directory
10 Free AI Tools That Actually Pay for Themselves in 2026

Ten free AI tools that actually replace paid SaaS in 2026: Claude, Perplexity, Llama 3.2, DeepSeek R1, GitHub Copilot, OpenRouter, HuggingFace, Jina, Playwright, and Mistral. Each tested across real workflows with realistic rate limits, accuracy benchmarks, and cost comparisons.

Apr 15, 2026 · 9 min read
→

AI Tools Directory
Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

Three coding assistants dominate 2026. Copilot stays safe for enterprises. Cursor wins on speed and accuracy for most developers. Windsurf's agent mode actually executes code to prevent hallucinations. Here's how to pick.

Apr 15, 2026 · 4 min read
→

AI Tools Directory
AI Tools That Actually Cut Hours From Your Week

I tested 30 AI productivity tools across writing, coding, research, and operations. Only 8 actually saved measurable time. Here's which tools have real ROI, the workflows where they win, and why most "AI productivity tools" fail.

Apr 14, 2026 · 12 min read
→

AI News
Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means

A developer claims to have reverse-engineered Google DeepMind's SynthID watermarking system using basic signal processing and 200 images. Google disputes the claim, but the incident raises questions about whether watermarking can be a reliable defense against AI-generated content misuse.

Apr 14, 2026 · 3 min read
→