You hit a token limit yesterday. Not because your prompt was verbose — because you didn’t understand how tokens work. You sent a 3,000-word document to Claude, watched it process the first 2,800 words, then return a truncated response. The issue wasn’t your request. It was that you didn’t account for output tokens, system prompts, and how different models count the same text differently.
Tokenization is how LLMs break text into chunks before processing. Understand it, and you’ll stop wasting API credits. Ignore it, and every integration you build will behave unexpectedly at scale.
What Tokens Actually Are
A token is not a word. This matters.
In most English text, one token ≈ 4 characters or 0.75 words. But that ratio breaks down fast. Punctuation, whitespace, numbers, code, and non-English text all tokenize differently. A comma might be one token. A number like 1,234,567 could be three or four. The word tokenization is one token. The acronym CPU is sometimes one, sometimes three, depending on the model.
Different models use different tokenizers. GPT-4o uses a different tokenizer than Claude Sonnet 4. OpenAI’s cl100k_base tokenizer counts text one way. Anthropic’s tokenizer counts it another. The same prompt can cost 150 tokens in GPT-4o and 140 tokens in Claude — or vice versa.
This inconsistency is why you can’t estimate token counts in your head. You need to measure.
Input vs. Output Tokens (and Why Both Matter)
Token limits have two sides: what you send in and what comes back out.
Your input includes the system prompt, the user message, any conversation history, and any context you’ve added. Your output is the model’s response. Most platforms price them separately — output tokens often cost more than input tokens. GPT-4o, for example, charges $5 per 1M input tokens and $15 per 1M output tokens. That 3:1 ratio changes how you should structure your requests.
If you’re summarizing documents, those tokens are input-heavy. If you’re generating code, output tokens dominate. If you’re doing a multi-turn conversation where you’re repeating context each turn, you’re paying input token costs repeatedly for the same information.
Token Limits in Production: Where They Break
Context windows have grown — Claude 3.5 Sonnet supports 200k tokens, GPT-4o supports 128k. But growth doesn’t mean your limits are gone. It means your limits are different now.
Here’s the real pattern: as your window grows, your prompts grow to fill it. Engineers start including entire codebases instead of snippets. Analysts include full datasets instead of samples. Lawyers include entire documents instead of excerpts. The cost per request climbs. Response latency increases. And somewhere around 80-85% of your available context, most models start performing worse — not better. They lose focus in long contexts.
The practical limit is not the technical maximum. It’s the point where cost or latency becomes unacceptable, or where model accuracy drops.
How to Count Tokens Without Guessing
Use the model’s own tokenizer. Don’t estimate.
For OpenAI models, use the tiktoken library:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Your prompt text here"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Estimated cost (input): ${len(tokens) * 0.000005:.4f}")
For Anthropic models, use the count_tokens API or their Python library:
from anthropic import Anthropic
client = Anthropic()
message = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Your prompt text here"}
]
)
print(f"Token count: {message.input_tokens}")
This is not optional. When you ship an integration, you need exact counts, not approximations. The difference between “roughly 100 tokens” and “106 tokens” is the difference between staying under your batch limit and failing at 3 a.m.
Strategies That Actually Save Tokens
Token-saving is not about writing shorter prompts. It’s about structure.
1. Don’t repeat context in multi-turn conversations. If you’re building a chatbot, send the system prompt once and rely on conversation memory. Each turn you repeat it, you’re wasting tokens. When building a customer service workflow where you’re processing many separate requests, yes, include context each time — but keep it minimal.
2. Use examples strategically. Few-shot prompts (including examples) are cheap in the moment but expensive over time. One conversation with examples costs more upfront than zero-shot. But if those examples reduce error rates by 30%, the token cost is worth it. If they change the output by 2%, they’re wasted tokens. Measure the trade-off.
Bad approach:
You are an email classifier. Classify each email as "sales", "support", or "spam".
Example 1: "Check out our new product!" → sales
Example 2: "Your order is ready for pickup" → support
Example 3: "Click here for free money" → spam
Example 4: "Special offer inside" → sales
Example 5: "Your password was reset" → support
Classify this email: [user input]
Better approach (for a single classification):
Classify this email as "sales", "support", or "spam": [user input]
The second saves 80 tokens. If you’re classifying thousands of emails, that’s significant cost reduction. Add examples back only if your error rate justifies the cost.
3. Use summarization to compress context. If you need to pass document history or prior analysis into a new request, summarize it first. The summary costs input tokens upfront but saves tokens on every downstream request that needs that context.
4. Batch similar requests. Instead of sending five separate API calls for five different prompts, combine them when possible. One request to classify ten emails costs fewer tokens than ten requests to classify one email each.
What to Do Today
Pick one integration or workflow you built in the last month. Run the actual prompts through your model’s token counter. Calculate the real input and output token costs per request. If you’re running this at scale, multiply by your request volume and your cost per token. You’ll find either that you have room to add context and improve quality, or that you need to cut ruthlessly. Most people discover they’ve been massively over-tokenizing because they were guessing counts.
Start measuring. Everything else follows from there.