You sent a 12,000-token prompt to Claude and got back a response that cut off mid-sentence. Or you built a system that worked fine in testing, then started failing in production because real user input pushed you over the limit. Token limits aren’t edge cases — they’re structural constraints you have to architect around.
Tokens aren’t words. That’s the first thing that breaks people’s intuition.
What Tokens Actually Are
A token is a chunk of text that a language model processes as a unit. One token can be a single character, part of a word, a whole word, or punctuation. The exact breakdown depends on the tokenizer — the algorithm that splits text into pieces before the model sees it.
English text averages about 1.3 tokens per word, but that’s just an average. Code is denser — often 1.7+ tokens per word because operators and brackets tokenize separately. JSON is even worse. A single space or newline can be its own token.
This matters because you’re charged per token, and your context window is measured in tokens, not words. If you think you have 128K tokens of room and you’re storing text at 1.5 tokens per word, you actually have about 85,000 words — not 128,000.
Most models publish their token limits as input + output. Claude 3.5 Sonnet has a 200K token context window. That means your prompt (input tokens) plus the model’s response (output tokens) together cannot exceed 200,000. If your prompt is 150K tokens, you have roughly 50K tokens left for the response before the model cuts off.
Why This Breaks Your Actual Plans
The most common failure: you design a system that works with a 10K-token prompt in isolation, then add RAG retrieval, conversation history, system instructions, and user input all stacked together. Now you’re at 45K tokens per request, and either you hit limits or your costs spike 4–5x what you estimated.
The second failure: you stuff everything into the context because you can, then the model’s output quality drops. Long contexts hurt reasoning. That’s not hyperbole — it’s measurable. Claude’s performance on tasks degrades noticeably beyond about 100K tokens, even though it can handle 200K.
The third failure: you don’t account for output tokens. You calculate your input cost, ship the system, and then discover the model’s responses are longer than expected. A 100-token prompt might generate a 800-token response if you’re asking for detailed analysis. Suddenly your per-request cost is 900 tokens, not 100.
Calculating Your Actual Token Usage
Stop guessing. Measure it.
Use the model provider’s tokenizer library before you deploy anything. For Claude, use the tokenizer in anthropic package. For GPT models, use tiktoken. Run your actual prompts through these and log the token count.
from anthropic import Anthropic, messages
import anthropic
client = Anthropic()
# Your prompt
system_prompt = """You are an analyst. Extract key metrics from the provided data.
Be concise. Format as JSON."""
user_input = """Here's Q3 financial data for Acme Corp...
[4000 words of actual data]
"""
# Count tokens BEFORE calling the API
token_count = len(client.beta.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system=system_prompt,
messages=[{"role": "user", "content": user_input}]
).input_tokens)
print(f"Your prompt: {token_count} tokens")
# Now make the call
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=system_prompt,
messages=[{"role": "user", "content": user_input}]
)
output_tokens = response.usage.output_tokens
print(f"Model response: {output_tokens} tokens")
print(f"Total cost: {token_count + output_tokens} tokens")
This isn’t optional. You need the actual numbers before you design the system architecture.
Structural Approaches to Stay Under Limits
Compress your system prompt. Unnecessary instructions add tokens without adding value. Compare:
# Bad system prompt (287 tokens)
You are a helpful customer service representative. You work for TechCorp,
a software company. When customers contact you, it is important that you
be polite, professional, and helpful. You should try to understand their
problems and help them find solutions. Always be respectful and patient.
Never be rude. You can provide technical information about our products.
Make sure to ask clarifying questions when needed. If you don't know the
answer, tell the customer you'll look into it.
# Good system prompt (89 tokens)
You are TechCorp customer support. Be direct and professional.
Ask clarifying questions. If you don't know, say so.
Provide technical product information. Stay focused on solving the issue.
Both convey the same instruction. The second is 68% smaller.
Use pagination for large documents. Don’t load all 50 pages of a document into one prompt. Split it into sections, retrieve only the relevant chunks via search or semantic matching, and pass those. This is why RAG systems exist — they’re token-efficient by design.
Limit conversation history. Keep the last 5–10 messages in a multi-turn conversation, not the entire chat. For most applications, older context adds noise, not signal, and costs tokens you don’t need to spend.
Structure output format from the start. If you want JSON, say it in the system prompt, not in the user message. If you want exactly 3 bullet points, specify that. Explicit formatting saves the model from generating fluff, which reduces output tokens.
What to Do Right Now
Pick one of your active prompts — something you’re using in production or testing regularly. Measure its actual token count using the provider’s tokenizer. Include the system prompt, the user input, and estimate the response length.
Calculate your total: input + output tokens. Now multiply by your usage volume over a month. If that number surprises you, compress your system prompt using the patterns above, then re-measure. You’ll often find 20–30% token savings from removing redundant instructions.