You’ve been using GPT-4o and Claude through their APIs. The cost per token is predictable. The latency is acceptable. But you keep hitting the same wall: the model doesn’t understand your domain well enough, and you can’t afford to prompt-engineer your way out of it.
Fine-tuning solves this. Not by magic. By showing a model 100–1,000 examples of exactly what you want it to do, then retraining its weights to match your specifics. I built AlgoVesta’s trading signal classifier on fine-tuned Llama 3 8B, and it cuts our inference cost by 70% compared to GPT-4 while matching accuracy on our domain.
Here’s what actually matters: data quality beats data quantity, you need a baseline before you tune, and deployment is where most people fail.
When Fine-Tuning Actually Makes Sense
Fine-tuning is not the answer to “my prompt isn’t working.” Fix the prompt first. Chain-of-thought, retrieval, or a system message usually solve the problem at 1/100th the cost.
Fine-tune when:
- You have 100+ labeled examples of your specific task (fewer doesn’t help much)
- The model’s base behavior is close but not quite right — you’re adjusting, not overwriting
- You need consistent output format or domain-specific terminology that generic models struggle with
- Inference cost or latency is a hard constraint (smaller, tuned models beat larger, generic ones on your task)
You should NOT fine-tune if you’re trying to add factual knowledge the model doesn’t have. Use RAG instead. If you’re trying to make the model more “helpful” or “friendly,” rewrite your prompt. Fine-tuning isn’t a personality transplant.
Data Preparation: The 80/20 of the Process
Spend most of your time here. Bad data ruins everything downstream.
Your training set needs: input, expected output, and ideally a label for confidence (some examples are more representative than others). For classification, this is straightforward. For generation (text summarization, email drafting), it’s messier — one input can have multiple correct outputs.
The format depends on the model and framework. OpenAI’s fine-tuning API wants JSONL:
{"messages": [{"role": "user", "content": "Classify this trade signal: MSFT closed 3% above MA50"}, {"role": "assistant", "content": "bullish_breakout"}]}
{"messages": [{"role": "user", "content": "Classify this trade signal: TSLA gapped down with high volume"}, {"role": "assistant", "content": "bearish_reversal"}]}
If you’re using Hugging Face with an open model, the format depends on your framework (usually the same structure, but different serialization).
Before you finalize your dataset:
- Audit 50 random examples manually. Check for labeling errors, unclear inputs, or data leakage (test data accidentally in training). I found 12% of our first dataset was mislabeled. We didn’t catch it until inference failed on known edge cases.
- Split 80/10/10: training/validation/test. Don’t let validation data touch training. Test data should be held completely separate and only used at the very end.
- Balance classes if you’re doing classification. If 90% of your examples are “no action” and 10% are “buy,” the model learns to always predict “no action.” Upsample the minority class or use class weights during training.
Choosing Your Base Model
Start with what you know. If you’ve been testing with Claude, OpenAI’s fine-tuning API is the path of least resistance. Anthropic doesn’t offer fine-tuning yet. If cost is critical, use an open model: Llama 3 8B or Mistral 7B run on consumer hardware and tune in hours instead of days.
Model choice has trade-offs:
- OpenAI GPT-3.5 Turbo: ~$1.50 per 1M training tokens. Fastest to productionize. Runs on their infrastructure. Best for when you want fine-tuning without managing compute.
- Llama 3 8B or Mistral 7B: Free model, but you pay for GPU compute. $0.24–$0.50 per hour on most cloud providers. Better if you have 500+ examples and want to iterate. Inference cost drops to nearly zero once deployed.
- Claude or GPT-4: Both more capable base models but more expensive to fine-tune (if OpenAI supports it) or unavailable (Anthropic). Don’t start here.
I recommend starting with GPT-3.5 Turbo if you have <300 examples. Use Llama 3 8B if you have >500 and can tolerate a slightly less capable base model that will still be excellent on your domain after tuning.
The Actual Fine-Tuning Process
This varies by platform, but the pattern is identical.
If you’re using OpenAI’s API:
import openai
openai.api_key = "your-key"
# Upload training file
response = openai.File.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
file_id = response["id"]
# Start fine-tuning job
job = openai.FineTuningJob.create(
training_file=file_id,
model="gpt-3.5-turbo",
hyperparameters={"n_epochs": 3} # 3 passes through your data
)
print(f"Job ID: {job['id']}")
# Check status
job_status = openai.FineTuningJob.retrieve(job["id"])
print(job_status["status"])
For Llama 3 on Hugging Face, you’d use the transformers library with LoRA (Low-Rank Adaptation) to tune only a small set of parameters instead of the full model — dramatically faster:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
lora_config = LoraConfig(
r=8, # Low-rank dimension
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05
)
model = get_peft_model(model, lora_config)
# Pass to your training loop (HuggingFace Trainer or custom)
The hyperparameters matter: n_epochs (3 is standard), learning rate (usually 2e-5 to 5e-4), and batch size (depends on your GPU memory). Start with defaults. Don’t chase performance tweaks until you have a baseline.
Evaluating Before You Ship
Your training loss will decrease. That means nothing. What matters: does the tuned model perform better on data it hasn’t seen?
Run your held-out test set through both the base model and your fine-tuned version. Compare accuracy, F1 score (for classification), or exact-match rate (for structured outputs). If the improvement is <5%, fine-tuning didn't help — you probably had a prompt or data problem instead.
Test on edge cases manually. Feed the model examples that are borderline, intentionally ambiguous, or domain-specific in ways your training data might not have covered. A 90% accuracy score means one in ten predictions are wrong. Know which ones.
Deployment: Where It Actually Breaks
Once tuned, your model sits in an API endpoint or on your hardware. The critical mistakes happen here:
You don’t version your fine-tunes. You push a new version to production and inference breaks. Name them: trading-signals-v1-20250115 includes the date and version. Keep the old endpoint alive for 2 weeks in parallel.
You don’t test the production endpoint with fresh data. Your test set was from last month. Real inference today is hitting different distributions. Monitor output entropy, latency, and error rates in production. If accuracy drops, your data shifted.
You forget the cost of ownership. A fine-tuned 8B model running 24/7 on an A100 costs ~$700/month. A cached GPT-3.5 Turbo endpoint for the same volume costs ~$90. Do the math before you deploy.
If you’re deploying on your infrastructure, use vLLM or TensorRT for inference optimization. You’ll cut latency by 50% and GPU memory by 30% compared to a raw transformers deployment.
Your Next Step
Gather 150–200 examples of your task. Spend a week cleaning and labeling them. Then run a cheap fine-tune on OpenAI’s GPT-3.5 Turbo (costs ~$5). Compare outputs side-by-side with the base model. That experiment tells you if fine-tuning is worth the time for your use case. Most of the time, a better prompt or RAG solves the problem. But when it doesn’t, you’ll know.