You’re running inference 50 times a day. A cloud API costs $0.03 per 1K tokens. That’s manageable until your usage scales. Local LLM on your hardware: zero per-token cost, runs offline, stays private. But it’s slower, requires setup, and your 8GB laptop can’t run Llama 2 70B effectively.
The decision isn’t “cloud or local.” It’s: which trade-off makes sense for your specific workload, budget, and constraints. This guide walks through the actual math — not theoretical comparisons, but numbers from running both in production.
The Real Cost Calculation: Token Costs vs Infrastructure
Cloud APIs charge per token. Local models charge in infrastructure, electricity, and developer time.
Let’s work through two scenarios with real numbers from March 2025:
| Metric | GPT-4o (OpenAI) | Claude 3.5 Sonnet (Anthropic) | Llama 3.1 70B (Local) | Mistral 7B (Local) |
|---|---|---|---|---|
| Input cost per 1M tokens | $5.00 | $3.00 | $0.00 | $0.00 |
| Output cost per 1M tokens | $15.00 | $15.00 | $0.00 | $0.00 |
| Min hardware requirement | N/A (cloud) | N/A (cloud) | GPU with 24GB+ VRAM | GPU with 8GB+ VRAM |
| Hardware cost (amortized/month)* | N/A | N/A | ~$40-80 | ~$30-60 |
| Inference latency (avg) | 800-1200ms | 600-900ms | 2000-4000ms | 1500-2500ms |
*Hardware costs assume 3-year amortization for a mid-tier GPU (RTX 4070 ~$550, A100 ~$10k cloud rental). Varies significantly by model and batch size.
Here’s what this means in practice. A production system processing 10M tokens per month:
- GPT-4o: ~$1,800/month token cost + $0 hardware = $1,800
- Claude 3.5 Sonnet: ~$1,080/month token cost + $0 hardware = $1,080
- Llama 3.1 70B (local): ~$0 token cost + $60/month hardware = $60
- Mistral 7B (local): ~$0 token cost + $45/month hardware = $45
The local option is cheaper. For most workloads under 100K tokens/month, cloud wins on total cost (hardware amortization still dominates). Beyond that, local becomes economical within 3-6 months.
But cost isn’t the only variable. Speed matters.
Latency: Why Cloud Is Often Faster Despite What You’d Expect
Local LLMs should be faster — inference happens on your hardware, no network hop. In practice, they’re slower. Why.
Cloud API providers optimize for two things you don’t: batch processing and specialized hardware. OpenAI and Anthropic run thousands of concurrent requests on A100 clusters. They’ve optimized every millisecond of the inference stack. Your local GPU is one machine.
Real-world latency comparison (from AlgoVesta systems, January 2025):
- Claude 3.5 Sonnet (API): 400-token response = ~840ms end-to-end (including network)
- Mistral 7B (local, RTX 4070): 400-token response = ~1,800ms
- Llama 3.1 8B (local, RTX 4070): 400-token response = ~1,200ms
- Llama 3.1 70B (local, RTX 4090): 400-token response = ~3,200ms
Smaller local models can match cloud latency in some cases (Mistral 7B is close). Larger local models are strictly slower. Network latency adds 100-300ms to cloud calls, but inference on the cloud side is that much faster.
If your use case requires sub-1000ms response times and you’re considering local, plan on Mistral 7B or smaller. Anything larger will disappoint.
Accuracy and Capability: When Local Models Fall Behind
This is the constraint nobody talks about honestly. Smaller local models sacrifice accuracy.
Benchmark data from MMLU (common sense reasoning across 57 disciplines):
- Claude 3.5 Sonnet: 88% accuracy
- GPT-4o: 86% accuracy
- Llama 3.1 70B: 83% accuracy
- Mistral 7B: 62% accuracy
- Phi-3.5-mini (3.8B): 51% accuracy
That gap widens on domain-specific tasks (medical reasoning, code generation, structured extraction). Local models also struggle with:
- Long-context reasoning: Llama 3.1 70B has 8K context window vs Claude’s 200K. Matters for RAG systems and document processing.
- Instruction following: Cloud models have better alignment. Smaller local models hallucinate instructions that don’t exist.
- Multilingual support: GPT-4o and Claude handle 50+ languages. Llama 3.1 is trained on 30+ but outputs are less reliable.
- Tool use: Cloud models reliably call functions. Local models fumble function parameter formatting.
For classification, summarization, and simple Q&A, the gap doesn’t matter. For anything requiring reasoning, creativity, or complex instruction parsing, local models lose.
Privacy and Data: The Real Win for Local
This isn’t theoretical. It’s contractual.
Using cloud APIs means data goes to their servers. Even with “no retention” clauses, it transits through their infrastructure. GDPR, HIPAA, and other regulations can forbid this. So can your customer agreements.
Local models solve this completely. Data never leaves your hardware. No logs, no cloud infrastructure, no third-party exposure.
But “local” has degrees. You still need:
- Model weights: Downloaded from HuggingFace or a vendor. That one-time download is logged.
- Hardware security: Your GPU server must be physically/network isolated. A misconfigured firewall defeats the entire privacy benefit.
- Inference framework: Tools like Ollama, vLLM, or Hugging Face’s inference server add another layer. Verify they don’t cache or log output.
For regulated industries (finance, healthcare, legal), local is often mandatory. For everything else, it’s an option if other constraints align.
Workload-Specific Recommendations: Where Each Wins
Use Cloud APIs When:
- High accuracy matters more than cost. Complex reasoning, code generation, creative tasks. Claude 3.5 Sonnet is the best option for this category.
- Variable load is unpredictable. You can’t predict token volume month-to-month. Cloud scales automatically. Local requires over-provisioning to handle peaks.
- Response time must be under 1 second. Even Mistral 7B will struggle. Cloud wins.
- Context length matters. Processing full documents, long conversations, or retrieval results. Claude’s 200K context window or GPT-4o’s 128K beats local options.
- You want zero infrastructure overhead. API key + HTTP request. That’s it. No GPU procurement, no version management, no CUDA debugging.
Use Local LLMs When:
- Privacy is non-negotiable. HIPAA workloads, regulated data, customer data that can’t transit cloud infrastructure.
- Token volume is predictable and high. 10M+ tokens/month, consistent load. Hardware ROI occurs within 6 months.
- The task is narrow and well-defined. Classification, extraction, summarization. Smaller models (7B-13B) work fine and are fast enough.
- Latency variability matters more than absolute speed. Local inference is more consistent than cloud (no queue fluctuations). Useful for real-time systems requiring predictable performance.
- You already have GPU infrastructure. Sunk-cost amortization changes the equation entirely.
Implementation Guide: Building a Hybrid Stack
The optimal setup for most production systems isn’t pure cloud or pure local. It’s both.
Architecture Pattern: Routing by Workload
Route different tasks to different models based on requirements:
class InferenceRouter:
def __init__(self):
self.cloud_client = Anthropic() # Claude API
self.local_model = LocalModel("mistral-7b") # Local inference
def process_request(self, task_type, input_text, metadata):
# High-accuracy, complex reasoning → cloud
if task_type in ["code_generation", "analysis", "reasoning"]:
return self.cloud_inference(input_text)
# Privacy-sensitive → local
if metadata.get("is_regulated") or metadata.get("pii_present"):
return self.local_inference(input_text)
# Simple classification/extraction → local (cheaper, fast enough)
if task_type in ["classification", "extraction"]:
return self.local_inference(input_text)
# Default: cost-optimized (use local, fallback to cloud if confidence low)
result = self.local_inference(input_text)
if result.get("confidence", 1.0) < 0.7:
return self.cloud_inference(input_text)
return result
def cloud_inference(self, text):
message = self.cloud_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": text}]
)
return {"output": message.content[0].text, "source": "cloud"}
def local_inference(self, text):
output = self.local_model.generate(text, max_tokens=1024)
return {"output": output, "source": "local"}
This pattern handles real constraints: you use local for routine work and cost savings, cloud for complex tasks and accuracy guarantees.
Cost-Tracking Setup
Without measurement, you won't know if the hybrid approach is actually saving money. Implement cost tracking from day one:
import json
from datetime import datetime
class CostTracker:
def __init__(self, log_file="inference_costs.jsonl"):
self.log_file = log_file
self.cloud_costs = {"claude-3-5-sonnet": 0.003} # per 1K tokens
self.local_cost_per_hour = 0.05 # amortized hardware + electricity
def log_inference(self, source, input_tokens, output_tokens, latency_ms):
"""Log every inference for cost analysis"""
if source == "cloud":
# Claude 3.5 Sonnet: $3/1M input, $15/1M output
cost = (input_tokens * 0.000003) + (output_tokens * 0.000015)
elif source == "local":
# Cost per ms of inference time (amortized)
cost = (latency_ms / 1000 / 3600) * self.local_cost_per_hour
else:
cost = 0
record = {
"timestamp": datetime.utcnow().isoformat(),
"source": source,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": latency_ms,
"cost_usd": round(cost, 6)
}
with open(self.log_file, "a") as f:
f.write(json.dumps(record) + "\n")
return cost
# Usage:
tracker = CostTracker()
tracker.log_inference("cloud", input_tokens=450, output_tokens=200, latency_ms=850)
tracker.log_inference("local", input_tokens=450, output_tokens=200, latency_ms=1800)
Run this for a month. You'll know exactly where your cost savings come from.
Setup Checklist: Getting Local Models Running
If you decide local makes sense for your workload, here's what deployment actually requires:
Hardware Check
Model size determines GPU requirements:
| Model | Size | Min VRAM | Recommended GPU | Inference Speed (4K context) |
|---|---|---|---|---|
| Mistral 7B | 7B params | 8GB | RTX 4060 / A10 | 1500-2000ms |
| Llama 2 13B | 13B params | 16GB | RTX 4070 / L40 | 2000-3000ms |
| Llama 3.1 70B | 70B params | 40GB | A100 40GB / RTX 6000 | 3000-5000ms |
| Phi-3.5-mini | 3.8B params | 4GB | Laptop GPU / RTX 4050 | 800-1200ms |
Software Setup (using vLLM)
vLLM is the standard inference framework for production. It's faster than transformers-based loading and handles batching automatically:
# Install vLLM
pip install vllm
# Run a model server
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.8
# In another terminal, test it
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "What is 2+2?",
"max_tokens": 100
}'
vLLM exposes an OpenAI-compatible API. This means you can swap endpoints without changing application code — just point your client to localhost:8000 instead of api.openai.com.
Docker for Production
Don't run vLLM directly on your production machine. Use Docker for isolation and reproducibility:
# Dockerfile for vLLM inference server
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
RUN pip install vllm==0.4.1
CMD ["vllm", "serve", "mistralai/Mistral-7B-Instruct-v0.2", \
"--host", "0.0.0.0", "--port", "8000"]
Build and run:
docker build -t local-llm .
docker run --gpus all -p 8000:8000 local-llm
Common Failure Modes and How to Avoid Them
Learning from actual deployments:
- Out-of-memory errors during inference: You tested with batch_size=1 but production runs batch_size=8. GPU runs out of VRAM mid-batch. Always test with production batch sizes and context lengths.
- Latency spikes from garbage collection: Local GPU memory fills, triggers GC, inference pauses. Set vLLM's `gpu_memory_utilization` to 0.85-0.90, not 0.95+.
- Model outputs degrading over time: Some local models have quality issues at longer token sequences. Measure output quality (via human review or automated metrics) at 1000 tokens, 4000 tokens, and 8000 tokens. Know your model's limits.
- API switch costs more than expected: You're using vLLM locally but cloud API remotely. Response formats differ slightly. Small differences in prompt formatting compound. Build a normalization layer in your client code.
- Cold-start latency kills performance: First inference after server restart is 2-3x slower (model loading). Either keep the server warm or pre-load on startup.
Decision Matrix: What to Do Today
Stop theorizing. Answer these three questions:
1. How many tokens/month do you process?
- Under 500K: Cloud APIs are cheaper. Use Claude 3.5 Sonnet.
- 500K-5M: Break-even zone. Depends on accuracy needs and privacy constraints.
- 5M+: Local models pay for hardware within 6 months. Implement hybrid routing.
2. Is data privacy a requirement?
- Yes: Go local, no debate. GDPR, HIPAA, and customer contracts often require it.
- No: Cloud API is simpler. Skip the infrastructure overhead.
3. How much accuracy do you need?
- Classification/extraction: Local 7B models work fine. Use Mistral 7B.
- Reasoning/code generation: Cloud models are significantly better. Use Claude 3.5 Sonnet.
- Unsure: Start with cloud, measure quality, swap to local if it meets thresholds.
If you're processing 2M+ tokens/month and accuracy isn't critical, implement the hybrid router from the earlier section today. Test it on 10% of your workload. Measure cost per inference across both sources. In 30 days, you'll have data to make the actual decision.