You have a problem your off-the-shelf LLM won’t solve. Claude or GPT-4 handles basic tasks fine, but for your specific use case — extracting data from contract clauses, grading essays in your curriculum, or classifying customer support tickets with 95% accuracy — the standard model drifts. So now you’re choosing between three paths: fine-tuning the model, engineering better prompts, or building a RAG system. Each one works. Each one fails for different reasons. The difference between picking right and picking wrong is the difference between shipping this week and sinking three months into a solution that cost too much and runs too slow.
Know What You’re Actually Trying to Fix
Before you commit to any approach, diagnose the actual problem.
Run 20-30 examples through your baseline model. Save the failures. Look at the pattern. Does the model:
- Hallucinate facts it doesn’t know? RAG solves this. Fine-tuning won’t.
- Misunderstand your domain’s specific terminology or logic? Fine-tuning or very specific prompting fixes this.
- Forget instructions halfway through a long response? Prompt engineering (structure, token optimization) fixes this. Sometimes RAG helps.
- Make sloppy mistakes in formatting or reasoning it clearly understands? Prompt engineering almost always fixes this first.
I’ve watched teams spend $40k fine-tuning a model to handle hallucinations when a RAG layer would have cost $800 and shipped in two weeks. Diagnosis comes before prescription.
Prompt Engineering: Your First Stop
Start here. Always.
Prompt engineering is free (in compute terms), reversible, and solves 60% of problems people think require fine-tuning. A disciplined approach takes four hours, not four weeks.
Here’s a realistic before/after:
# Bad prompt
Extract the key contract terms from this document:
{document_text}
Failure mode: The model picks random terms, misses critical dates, confuses effective dates with termination dates.
# Improved prompt
You are a legal contract analyst. Extract only the following terms from the contract:
- Effective Date (when the contract begins)
- Termination Date (when it ends)
- Renewal Terms (automatic or manual)
- Payment Amount (total contract value)
- Payment Schedule (when payments are made)
For each term:
1. Find the exact clause reference (e.g., "Section 3.2")
2. Quote the relevant text
3. Provide the extracted value
4. If a term is not present, write "NOT FOUND" — do not guess
Document:
{document_text}
Analysis:
Better. Not magic — you’ve just removed ambiguity and added explicit instructions for handling missing data. In practice, this approach drops extraction errors by 40-60% on structured tasks. Use grounding prompts (feeding the model context it needs) and chain-of-thought (making the model show its work) before you think about fine-tuning.
When prompt engineering alone fails: You’ve optimized the prompt, tested it across 100+ examples, and you’re still getting 75% accuracy when you need 95%. That’s your signal to move to the next layer.
RAG: For Knowledge Your Model Doesn’t Have
RAG (Retrieval-Augmented Generation) works by feeding the model relevant context it doesn’t have in its training data. You give it a vector database of your domain documents, and before each response, the system retrieves the most relevant passages.
Build RAG when:
- Your model needs to cite specific sources or answer questions about your proprietary data
- Your knowledge base changes monthly (contracts, policies, product docs)
- You need to reduce hallucinations on factual questions
- You want the model to say “I don’t know” rather than guess
Here’s a practical Python example using Claude and a mock retrieval function:
import anthropic
def retrieve_context(query: str) -> list[str]:
# In production, this queries a vector database (Pinecone, Weaviate, etc.)
# For demo, returning mock documents
docs = {
"refund_policy": "Refunds are allowed within 30 days of purchase.",
"shipping": "Standard shipping takes 5-7 business days.",
}
return [docs[k] for k in docs if any(word in query.lower() for word in docs[k].lower().split())]
def rag_query(user_question: str) -> str:
client = anthropic.Anthropic()
# Retrieve relevant documents
retrieved_docs = retrieve_context(user_question)
context = "\n".join(retrieved_docs) if retrieved_docs else "No relevant documents found."
# Build prompt with context
prompt = f"""Answer the question using ONLY the provided context.
If the context doesn't contain the answer, say 'I don't have that information.'
Context:
{context}
Question: {user_question}"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
# Test
print(rag_query("Can I return something I bought last week?"))
# Output: "Yes, according to our refund policy, refunds are allowed within 30 days of purchase."
RAG’s strength: It’s cheap to maintain, handles fresh data without retraining, and cuts hallucinations on factual tasks by 70-90%. Its weakness: It only helps if your problem is missing knowledge, not reasoning or format.
Fine-Tuning: When the Basics Aren’t Enough
Fine-tuning teaches a model a new skill or consistent style it doesn’t naturally exhibit. This costs time and money — Anthropic’s fine-tuning (as of March 2025) runs $3 per 1M input tokens and $15 per 1M output tokens, on top of the base API cost.
Fine-tune when:
- You need the model to follow a very specific output format (XML structure, JSON with exact keys) consistently
- The model needs to learn domain-specific reasoning (e.g., grading rubrics your teaching platform uses)
- Prompt engineering has plateaued — you’re at 85% accuracy and stuck there
- You need this behavior in 500+ API calls per month (cost-benefit becomes favorable)
Fine-tuning will NOT fix hallucinations. It will NOT teach the model facts it doesn’t know. It WILL teach it patterns.
Here’s the cost math: If you run 1000 inference calls/month with Claude 3.5 Sonnet, each call ~2000 output tokens, that’s 2B tokens/month × $15/1M = $30/month in inference cost. Fine-tuning setup is ~$100-300. Payoff happens around month 5 if fine-tuning cuts latency or improves output quality enough to skip human review on 30% of responses. Below that threshold, it’s not worth it.
The Decision Matrix
| Problem | First Try | If That Fails |
|---|---|---|
| Model makes up facts | RAG + grounding prompt | Upgrade model (GPT-4o vs Sonnet) |
| Wrong output format | Structured prompt + examples | Fine-tuning or JSON mode |
| Misses domain concepts | Few-shot prompting + definitions | Fine-tuning (150+ labeled examples) |
| Forgets instructions mid-response | Prompt engineering (token optimization) | Break task into smaller steps |
| Slow inference latency | Use smaller model (Sonnet vs Opus) | Fine-tune or local model (Mistral 7B) |
What to Do Today
Take your hardest failing case — the one you’ve been thinking about fine-tuning or RAG-ing. Run it through a properly structured prompt first. Use the contract extraction template above if it’s a data problem, or write a chain-of-thought prompt if it’s a reasoning problem. Test it against 20 examples. Count the accuracy.
If you hit 90%+ accuracy with prompt engineering alone, stop. You’re done. If you’re stuck at 75-85%, diagnose: Is the model missing knowledge (RAG) or missing a skill (fine-tuning)? That distinction will save you weeks.