Learning Lab April 9, 2026 · 5 min read

Tokenization Explained: Why Token Limits Matter and How to Work Within Them

Tokens are not words, and they're counted differently by every model. Learn exactly how tokenization works, where limits break in production, and concrete techniques to stay under budget without sacrificing quality.

You hit a token limit yesterday. Not because your prompt was verbose — because you didn’t understand how tokens work. You sent a 3,000-word document to Claude, watched it process the first 2,800 words, then return a truncated response. The issue wasn’t your request. It was that you didn’t account for output tokens, system prompts, and how different models count the same text differently.

Tokenization is how LLMs break text into chunks before processing. Understand it, and you’ll stop wasting API credits. Ignore it, and every integration you build will behave unexpectedly at scale.

What Tokens Actually Are

A token is not a word. This matters.

In most English text, one token ≈ 4 characters or 0.75 words. But that ratio breaks down fast. Punctuation, whitespace, numbers, code, and non-English text all tokenize differently. A comma might be one token. A number like 1,234,567 could be three or four. The word tokenization is one token. The acronym CPU is sometimes one, sometimes three, depending on the model.

Different models use different tokenizers. GPT-4o uses a different tokenizer than Claude Sonnet 4. OpenAI’s cl100k_base tokenizer counts text one way. Anthropic’s tokenizer counts it another. The same prompt can cost 150 tokens in GPT-4o and 140 tokens in Claude — or vice versa.

This inconsistency is why you can’t estimate token counts in your head. You need to measure.

Input vs. Output Tokens (and Why Both Matter)

Token limits have two sides: what you send in and what comes back out.

Your input includes the system prompt, the user message, any conversation history, and any context you’ve added. Your output is the model’s response. Most platforms price them separately — output tokens often cost more than input tokens. GPT-4o, for example, charges $5 per 1M input tokens and $15 per 1M output tokens. That 3:1 ratio changes how you should structure your requests.

If you’re summarizing documents, those tokens are input-heavy. If you’re generating code, output tokens dominate. If you’re doing a multi-turn conversation where you’re repeating context each turn, you’re paying input token costs repeatedly for the same information.

Token Limits in Production: Where They Break

Context windows have grown — Claude 3.5 Sonnet supports 200k tokens, GPT-4o supports 128k. But growth doesn’t mean your limits are gone. It means your limits are different now.

Here’s the real pattern: as your window grows, your prompts grow to fill it. Engineers start including entire codebases instead of snippets. Analysts include full datasets instead of samples. Lawyers include entire documents instead of excerpts. The cost per request climbs. Response latency increases. And somewhere around 80-85% of your available context, most models start performing worse — not better. They lose focus in long contexts.

The practical limit is not the technical maximum. It’s the point where cost or latency becomes unacceptable, or where model accuracy drops.

How to Count Tokens Without Guessing

Use the model’s own tokenizer. Don’t estimate.

For OpenAI models, use the tiktoken library:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Your prompt text here"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Estimated cost (input): ${len(tokens) * 0.000005:.4f}")

For Anthropic models, use the count_tokens API or their Python library:

from anthropic import Anthropic

client = Anthropic()

message = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Your prompt text here"}
    ]
)
print(f"Token count: {message.input_tokens}")

This is not optional. When you ship an integration, you need exact counts, not approximations. The difference between “roughly 100 tokens” and “106 tokens” is the difference between staying under your batch limit and failing at 3 a.m.

Strategies That Actually Save Tokens

Token-saving is not about writing shorter prompts. It’s about structure.

1. Don’t repeat context in multi-turn conversations. If you’re building a chatbot, send the system prompt once and rely on conversation memory. Each turn you repeat it, you’re wasting tokens. When building a customer service workflow where you’re processing many separate requests, yes, include context each time — but keep it minimal.

2. Use examples strategically. Few-shot prompts (including examples) are cheap in the moment but expensive over time. One conversation with examples costs more upfront than zero-shot. But if those examples reduce error rates by 30%, the token cost is worth it. If they change the output by 2%, they’re wasted tokens. Measure the trade-off.

Bad approach:

You are an email classifier. Classify each email as "sales", "support", or "spam".

Example 1: "Check out our new product!" → sales
Example 2: "Your order is ready for pickup" → support
Example 3: "Click here for free money" → spam
Example 4: "Special offer inside" → sales
Example 5: "Your password was reset" → support

Classify this email: [user input]

Better approach (for a single classification):

Classify this email as "sales", "support", or "spam": [user input]

The second saves 80 tokens. If you’re classifying thousands of emails, that’s significant cost reduction. Add examples back only if your error rate justifies the cost.

3. Use summarization to compress context. If you need to pass document history or prior analysis into a new request, summarize it first. The summary costs input tokens upfront but saves tokens on every downstream request that needs that context.

4. Batch similar requests. Instead of sending five separate API calls for five different prompts, combine them when possible. One request to classify ten emails costs fewer tokens than ten requests to classify one email each.

What to Do Today

Pick one integration or workflow you built in the last month. Run the actual prompts through your model’s token counter. Calculate the real input and output token costs per request. If you’re running this at scale, multiply by your request volume and your cost per token. You’ll find either that you have room to add context and improve quality, or that you need to cut ruthlessly. Most people discover they’ve been massively over-tokenizing because they were guessing counts.

Start measuring. Everything else follows from there.

Batikan

April 9, 2026 · 5 min read

Topics & Keywords

Learning Lab #api cost reduction #llm context windows #llm optimization #prompt engineering basics #token counting tokens token cost input output tokens text input tokens context

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

A developer claims to have reverse-engineered Google DeepMind's SynthID watermarking system using basic signal processing and 200 images. Google disputes the claim, but the incident raises questions about whether watermarking can be a reliable defense against AI-generated content misuse.

Apr 14, 2026 · 3 min read

→

What Tokens Actually Are

Input vs. Output Tokens (and Why Both Matter)

Token Limits in Production: Where They Break

How to Count Tokens Without Guessing

Strategies That Actually Save Tokens

What to Do Today

📚 Related Articles

Stay ahead of the AI curve

Related Articles

Build Your First AI Agent Without Code

Context Window Management: Processing Long Docs Without Losing Data

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Connect LLMs to Your Tools: A Workflow Automation Setup

Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

10 ChatGPT Workflows That Actually Save Time in Business

More from Prompt & Learn

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

DeepL Adds Voice Translation. Here’s What Changes for Teams

10 Free AI Tools That Actually Pay for Themselves in 2026

Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

AI Tools That Actually Cut Hours From Your Week

Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means

Stay ahead of the AI curve