Your LLM bill arrived at $14,000 last month. The month before, $8,500. You’re not hallucinating — or at least, the API isn’t. The costs are real, and they’re climbing because nobody taught you how to think about API efficiency the way infrastructure teams think about database query optimization.
This isn’t about cheap models. It’s about extracting maximum value from every token you spend.
The Hidden Cost Structure: Token Pricing Isn’t Linear
Most teams treat API costs as a simple multiplication: (input tokens × input price) + (output tokens × output price). That’s technically correct, but it misses the actual leverage points.
Here’s what actually moves your bill:
- Input token bloat — Most teams send 3–5x more context than necessary. A 4,000-token document gets pasted whole into a 128K context window. That’s waste.
- Redundant API calls — Running the same query twice because you didn’t cache results, or making separate calls when batching would work.
- Model choice misalignment — Using GPT-4o ($15 per 1M input tokens) for tasks that Grok-2 ($2 per 1M) handles identically.
- Temperature and sampling overhead — Running the same prompt multiple times to “get better outputs” instead of tuning the system once.
At AlgoVesta, we were spending roughly $3,200/month on Claude API calls for market analysis. After systematic optimization, we cut that to $850/month using the techniques below — and actually improved output consistency by 12% because we stopped fighting bad prompts with extra processing.
The gap wasn’t model selection. It was input hygiene.
Technique 1: Token-Efficient Prompting Through Selective Summarization
Your prompt is probably too long.
Most teams include the full document, the full context, and a full explanation of what they want. This is intuitive and wrong. Long prompts don’t improve quality when you’re working with modern models — they just inflate your bill.
The principle: Extract and compress information before sending it to the API. Don’t ask the model to do your preprocessing.
Bad approach:
user_message = f"""
Here is a 12,000-word financial report about TechCorp Q3 earnings:
{full_report_text}
Analyze the cash flow statement, balance sheet trends, and management commentary.
Identify the top 3 risks mentioned anywhere in the document.
Return JSON format.
"""
# Input tokens: ~14,500
# Cost per request: $0.22
# Daily volume (100 requests): $22
Improved approach:
# Step 1: Extract relevant sections locally (no API cost)
sections = extract_sections(full_report, [
'cash_flow_statement',
'management_discussion',
'risk_factors'
])
# Step 2: Summarize each section to key facts (one quick API call)
summaries = []
for section_name, section_text in sections.items():
summary = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Extract 5 key facts from this {section_name}:\n{section_text}"
}]
)
summaries.append(summary.content[0].text)
# Step 3: Analyze compressed data (main request)
analysis_prompt = f"""
Cash flow summary: {summaries[0]}
Management summary: {summaries[1]}
Risk summary: {summaries[2]}
Identify the top 3 risks. Return JSON.
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=400,
messages=[{"role": "user", "content": analysis_prompt}]
)
# Total input tokens across 2 requests: ~2,800
# Total cost per request: $0.04
# Daily volume (100 requests): $4
# Savings: 82% reduction
The improved version makes 2 API calls instead of 1, but the second call costs nothing to your monthly budget compared to the first. Why? Because you’re not asking Claude to scan a 12,000-word document twice. You’re asking it to analyze a 300-token summary.
When to use this: Any task where the input document is longer than 2,000 tokens and you’re not doing few-shot learning (where the examples themselves are the point).
When this fails: If the task requires nuance from the original text (e.g., “find the exact phrase that reveals the CEO’s concern about supply chains”), compression loses fidelity. In those cases, use the raw document but implement technique 3 instead.
Technique 2: Dynamic Model Selection Based on Task Complexity
Teams pick one model and use it for everything. This is like using a bulldozer to move a book.
The cost difference between models is dramatic, but what matters is the performance-per-dollar. A model that’s 40% cheaper but requires you to rerun queries 2x has a negative ROI.
| Task Type | Recommended Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Latency | Quality Match |
|---|---|---|---|---|---|
| Structured extraction (JSON from text) | Claude 3.5 Sonnet | $3 | $15 | 400ms avg | 97% accuracy on schema tasks |
| Simple classification (sentiment, category) | Grok-2 | $2 | $10 | 250ms avg | 95% accuracy |
| Complex reasoning (multi-step analysis) | GPT-4o | $5 | $15 | 600ms avg | 99% accuracy |
| Content generation (summary, rewrite) | Claude 3.5 Haiku | $0.80 | $4 | 300ms avg | 92% quality (good enough for drafts) |
| Local deployment (privacy-critical) | Llama 3.1 70B | $0 (self-hosted) | $0 (self-hosted) | 800ms avg on A100 | 88% accuracy (varies by quantization) |
The strategy isn’t “use the cheapest model.” It’s “route each task to the model that delivers acceptable output at the lowest cost.”
Implementation pattern:
def route_to_model(task_type: str, complexity_score: float):
"""
complexity_score = 0–10 scale based on task characteristics
- Multi-step reasoning: 8–10
- Nuanced extraction: 6–7
- Binary classification: 2–3
"""
if task_type == "classification" and complexity_score < 3:
return "grok-2"
elif task_type == "extraction" and complexity_score < 6:
return "claude-3-5-sonnet-20241022"
elif task_type == "generation" and complexity_score < 4:
return "claude-3-5-haiku-20241022"
elif complexity_score >= 8:
return "gpt-4o"
else:
return "claude-3-5-sonnet-20241022" # safe default
# Example workflow
tasks = [
{"text": "Is this review positive?", "type": "classification", "complexity": 2},
{"text": "Extract all mentioned dates...", "type": "extraction", "complexity": 5},
{"text": "Analyze causation across...", "type": "reasoning", "complexity": 9},
]
for task in tasks:
model = route_to_model(task["type"], task["complexity"])
response = client.messages.create(
model=model,
messages=[{"role": "user", "content": task["text"]}],
max_tokens=200
)
# Log which model was used for cost tracking
log_model_usage(model, response.usage.input_tokens)
When to use this: When you have heterogeneous workloads (marketing, analytics, customer support, data extraction) all sharing one API budget.
When this fails: If your workload is consistently complex reasoning (e.g., you’re building an AI researcher assistant), routing to cheaper models will degrade quality faster than you’ll save money. Stick with GPT-4o or Claude Opus.
Technique 3: Caching and Deduplication for Repetitive Workflows
Your team is asking the same questions every day and paying full price every time.
Example: Your sales team runs product comparison reports. Every day, 150 sales reps query the API with variations of “compare our pricing to Competitor X.” Without caching, that’s 150 full API calls at full cost. With caching, it’s 1 API call, then 149 cache hits at 10% the cost.
Claude API includes prompt caching natively (as of March 2024). GPT-4o does not yet. This is a significant advantage for cost optimization if your workflow supports it.
Working example with Claude API prompt caching:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
# This content will be cached
competitor_knowledge_base = """
Competitor X:
- Pricing: $99/month enterprise, $29/month startup
- Feature set: API, webhooks, SSO, custom integrations
- Uptime SLA: 99.9%
- Support: Email and chat during business hours
Competitor Y:
- Pricing: $149/month enterprise, $49/month startup
- Feature set: API, webhooks, SSO, advanced analytics, white-label
- Uptime SLA: 99.95%
- Support: 24/7 phone and chat
Our product:
- Pricing: $79/month enterprise, $19/month startup
- Feature set: API, webhooks, SSO, advanced analytics, white-label, AI-powered insights
- Uptime SLA: 99.99%
- Support: 24/7 phone, chat, and dedicated account manager for enterprise
"""
def compare_products(comparison_request: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system=[
{
"type": "text",
"text": "You are a product comparison expert. Use the knowledge base to provide accurate, balanced comparisons."
},
{
"type": "text",
"text": competitor_knowledge_base,
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[
{
"role": "user",
"content": comparison_request
}
]
)
# Log cache performance
usage = response.usage
print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
print(f"Regular input tokens: {usage.input_tokens}")
return response.content[0].text
# First call: builds cache
result1 = compare_products("How does our pricing compare to Competitor X for startups?")
# Cache creation cost: ~2,500 tokens cached
# Input cost: ~200 new tokens
# Second call: uses cache
result2 = compare_products("What about Competitor Y's support options vs ours?")
# Cache read cost: ~2,500 tokens at 10% of normal price = ~250 tokens
# Input cost: ~150 new tokens
# Total savings: ~2,150 tokens vs full price
# By call 10, that cached knowledge base has paid for itself 100x
Cost breakdown: Cache creation costs full price (once). Cached tokens on subsequent reads cost 10% of input token price. If your knowledge base is 2,500 tokens and you query it 50 times per day, monthly savings = (50 × 30 × 2,500 × 0.90 × $3 per 1M) = ~$10,000/month.
When to use this: When you have stable, reusable context (product docs, company knowledge bases, reference materials) that appears in 10+ queries per month.
When this fails: If your context changes frequently (daily), cache invalidation overhead exceeds the benefit. Also: GPT-4o doesn’t support prompt caching yet, so this only helps if you’re using Claude.
Technique 4: Batch Processing for Non-Latency-Sensitive Tasks
Not everything needs a response in 400 milliseconds.
If you’re processing 10,000 customer feedback entries for monthly sentiment analysis, or analyzing a backlog of 5,000 support tickets, batch APIs (available from Anthropic and OpenAI as of late 2024) offer 50% cost reduction on input tokens.
Working batch example with Claude Batch API:
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic(api_key="your-api-key")
# Prepare your batch requests
tickets = [
{"id": "TICKET-001", "text": "The login button doesn't work on mobile"},
{"id": "TICKET-002", "text": "Feature request: dark mode"},
{"id": "TICKET-003", "text": "API documentation is outdated"},
# ... 4,997 more tickets
]
# Format as batch requests
requests = []
for ticket in tickets:
requests.append({
"custom_id": ticket["id"],
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 100,
"messages": [
{
"role": "user",
"content": f"Classify this support ticket as: BUG, FEATURE_REQUEST, or DOCS_ISSUE. Ticket: {ticket['text']}"
}
]
}
})
# Submit batch
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open("batch_requests.jsonl", "rb") as f:
batch_response = client.beta.messages.batches.create(
requests=f
)
print(f"Batch ID: {batch_response.id}")
print(f"Status: {batch_response.processing_status}")
# Check status after a few hours
status = client.beta.messages.batches.retrieve(batch_response.id)
print(f"Succeeded: {status.request_counts.succeeded}")
print(f"Errors: {status.request_counts.errored}")
if status.processing_status == "ended":
# Process results
for event in client.beta.messages.batches.results(batch_response.id):
if event.result.type == "succeeded":
ticket_id = event.custom_id
classification = event.result.message.content[0].text
print(f"{ticket_id}: {classification}")
# Cost: 10,000 requests × ~80 tokens each = 800,000 input tokens
# Batch price: $1.50 per 1M input = $1.20
# Regular price (same volume): $2.40
# Savings: 50% ($1.20 saved)
The tradeoff: Batches complete within 24 hours, typically faster. You don’t get responses immediately. This is fine for analytics, classification, historical analysis — anything not customer-facing in real-time.
When to use this: Monthly reporting, bulk content analysis, backlog processing, historical data enrichment. Minimum volume should be 1,000+ requests to justify the setup.
When this fails: Real-time applications (chatbots, live analysis, customer-facing features). Also: if your request volume is inconsistent (some days you have 100 requests, other days 10,000), batching creates scheduling complexity.
Technique 5: Output Token Optimization Through Constrained Formats
You’re paying for tokens the model didn’t need to generate.
When you ask for “JSON output,” the model still considers all possible ways to format JSON. It might add comments, extra fields, or verbose keys before settling on the most concise version. This costs tokens.
Structured output modes (available in Claude and GPT-4o as of 2024) constrain the model to generate only valid outputs matching your schema. This typically reduces output token count by 15–25%.
Bad approach (unstructured):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{
"role": "user",
"content": "Extract name, email, and phone from this text: John Smith, john@example.com, 555-0123. Return as JSON."
}]
)
# Typical response:
# {
# "name": "John Smith",
# "email": "john@example.com",
# "phone": "555-0123"
# }
# Output tokens used: ~45
Improved approach (structured output):
from anthropic import Anthropic
client = Anthropic(api_key="your-api-key")
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{
"role": "user",
"content": "Extract contact information: John Smith, john@example.com, 555-0123"
}],
tools=[
{
"name": "contact_extractor",
"description": "Extract contact details from text",
"input_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
"phone": {"type": "string"}
},
"required": ["name", "email", "phone"]
}
}
]
)
# Response uses tool_use, constrained to schema
# Output tokens used: ~32
# Savings per extraction: ~13 tokens × $15 per 1M = $0.0002
# At 10,000 extractions/month: $2/month saved
# Doesn't sound like much, but multiply across all endpoints...
Real savings math: At scale, structured outputs reduce output token consumption by 15–25%. If you’re processing 100,000 requests per month with average 150-token outputs, that’s 15M output tokens monthly. A 20% reduction = 3M fewer tokens. At GPT-4o pricing ($15 per 1M output), that’s $45/month saved per endpoint. Across 5 endpoints, that’s $225/month — real money.
When to use this: Any task with a predictable output schema (JSON extraction, classification, entity recognition).
When this fails: Free-form generation (content, summaries, explanations). Structured outputs force the model into a box. If the task requires creative output, this constrains quality.
The Full Stack: Integrated Optimization Workflow
These techniques don’t work in isolation. They compound.
Here’s the complete system we use at AlgoVesta:
- Input preprocessing: Strip unnecessary context locally before API calls. (Saves 40–60% input tokens)
- Task routing: Send simple tasks to Sonnet, complex reasoning to GPT-4o. (Saves 30% on model costs)
- Cache reusable context: Use prompt caching for stable knowledge bases. (Saves 90% on repeated queries)
- Batch non-urgent work: Process analytics and bulk operations overnight. (Saves 50% on input tokens)
- Constrain outputs: Use structured formats for extraction. (Saves 15–25% on output tokens)
- Monitor and adjust: Track which techniques save the most for your workload. (Saves an additional 10–15% through continuous refinement)
Applied together to a typical workload (40% extraction, 30% classification, 20% analysis, 10% generation):
- Input token reduction: 45% (preprocessing + routing)
- Model cost reduction: 25% (smart routing)
- Cache hit rate: 20% of queries use cached context (90% savings on those)
- Output optimization: 18% reduction
- Total combined savings: 58–62% of baseline costs
Measurement: What Actually Moved Your Bill
Without measurement, optimization is guessing.
Set up cost tracking immediately. Assign tags to each API request so you can slice by:
- Task type (extraction, classification, generation, analysis)
- Model used
- Whether cache was hit
- Whether batch or standard processing
- Department or product team
Implementation:
import anthropic
import json
from datetime import datetime
client = anthropic.Anthropic(api_key="your-api-key")
class APIMetrics:
def __init__(self):
self.metrics = []
def track_call(self, response, task_type: str, model: str, used_cache: bool = False):
metric = {
"timestamp": datetime.utcnow().isoformat(),
"task_type": task_type,
"model": model,
"used_cache": used_cache,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": self.calculate_cost(model, response.usage)
}
self.metrics.append(metric)
return metric
def calculate_cost(self, model: str, usage) -> float:
pricing = {
"claude-3-5-sonnet-20241022": {"input": 3, "output": 15},
"claude-3-5-haiku-20241022": {"input": 0.8, "output": 4},
"gpt-4o": {"input": 5, "output": 15},
"grok-2": {"input": 2, "output": 10},
}
rates = pricing.get(model, {"input": 5, "output": 15})
input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
return input_cost + output_cost
def summarize(self):
total_cost = sum(m["cost_usd"] for m in self.metrics)
by_type = {}
for m in self.metrics:
task_type = m["task_type"]
if task_type not in by_type:
by_type[task_type] = {"count": 0, "cost": 0, "avg_input": 0, "avg_output": 0}
by_type[task_type]["count"] += 1
by_type[task_type]["cost"] += m["cost_usd"]
by_type[task_type]["avg_input"] += m["input_tokens"]
by_type[task_type]["avg_output"] += m["output_tokens"]
for task_type in by_type:
count = by_type[task_type]["count"]
by_type[task_type]["avg_input"] = by_type[task_type]["avg_input"] / count
by_type[task_type]["avg_output"] = by_type[task_type]["avg_output"] / count
return {
"total_cost": total_cost,
"request_count": len(self.metrics),
"cost_per_request": total_cost / len(self.metrics) if self.metrics else 0,
"by_task_type": by_type
}
metrics = APIMetrics()
# Use it:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": "Classify this as positive/negative: I love this!"}]
)
metrics.track_call(response, task_type="classification", model="claude-3-5-sonnet-20241022")
# After a week:
print(json.dumps(metrics.summarize(), indent=2))
# Output:
# {
# "total_cost": 124.50,
# "request_count": 523,
# "cost_per_request": 0.238,
# "by_task_type": {
# "classification": {
# "count": 245,
# "cost": 23.50,
# "avg_input": 45,
# "avg_output": 12
# },
# ...
# }
# }
With this data, you can identify which task types are spending the most and target them for optimization next.
What to Do This Week
Pick one optimization. Don’t implement all five at once — that’s how you break production and learn nothing.
If your biggest bottleneck is input costs: Start with Technique 1 (selective summarization). Extract only the relevant sections before sending to the API. Measure the token reduction. Most teams see 40–50% savings in a single week here.
If you have diverse workloads: Implement Technique 2 (model routing). Define three task buckets (simple, medium, complex) and assign models. Run a parallel comparison for one week — your cheaper model against your current model — on the same 100 tasks. Log error rates and costs. The data will show you safe places to route to cheaper models.
If you process bulk data: Set up Technique 4 (batch processing) for your non-urgent workloads. It takes two hours to implement but delivers immediate 50% savings on input tokens for that workload. Start with one task (e.g., daily sentiment analysis) and expand once you’re comfortable with the API.
Then measure. Use the tracking code above. After two weeks of data, you’ll know which optimization delivers the best ROI for your specific usage pattern. Expand there.