Learning Lab March 29, 2026 · 9 min read

Tokens: Why They Cost Money and How to Count Them

Tokens are the unit LLMs use to process language, but they don't map to words or characters. This guide explains what tokens are, why context windows matter, how to measure them accurately, and practical techniques to stay efficient under constraints.

Every LLM interaction comes with a hidden tax. You hit “send” on a prompt, the model processes it, generates a response, and you pay by the token. Not by the word. Not by the character. By the token—a unit that doesn’t map cleanly to anything you can see on your screen.

Most people building with LLMs don’t actually know what a token is. They see a price per token, multiply it out, and move on. Then their costs explode for reasons they can’t explain. Or they bump into context window limits and wonder why their 8,000-word document got cut off mid-processing.

Tokenization isn’t just a billing detail. It’s the boundary condition of everything you build with language models. Understand it, and you unlock efficiency gains that compound across every system you run. Ignore it, and you’ll waste money, hit limits at frustrating moments, and build worse products.

What a Token Actually Is

A token is the smallest unit a language model works with. It’s not a word. It’s not a character. It’s something in between, and the exact boundaries shift depending on which model you use and which tokenizer encodes your text.

Think of it like this: a tokenizer is a lookup table. Raw text goes in one side. On the other side comes a sequence of integers—token IDs. The model sees those integers, processes them, and generates new integers back out.

Here’s what happens in practice:

The word “hello” = 1 token
The word “unfortunately” = 2 tokens (un + fortunately)
The punctuation mark “.” = 1 token
A space before a word = 1 token (usually)
The sequence “\n\n” (paragraph break) = 1 token

Short, common words compress into single tokens. Longer or rarer words split across multiple tokens. Special characters, numbers, and whitespace all have their own encoding rules.

Different models use different tokenizers. OpenAI’s GPT models (GPT-4o, GPT-4 Turbo) use the cl100k_base tokenizer. Anthropic’s Claude models use a different tokenizer. Llama 3 70B uses yet another. This matters: the same sentence tokenizes to different lengths depending on which model processes it.

Why Token Count Isn’t Intuitive

You write 500 words. You assume that’s roughly 500 tokens, maybe 700 if you count formatting. Then you run it through a tokenizer and get 820 tokens. Or 1,240. Or something else entirely.

The mismatch happens because tokens don’t follow word boundaries. Here’s a real example:

# Input text
"The ChatGPT API doesn't return token counts automatically."

# Word count: 9
# Token count (OpenAI cl100k_base): 13

Why 13? Break it down:

“The” = 1 token
“ChatGPT” = 1 token
” API” = 1 token (space + word)
“doesn” = 1 token
“‘t” = 1 token (contraction splits)
” return” = 1 token
” token” = 1 token
” counts” = 1 token
” automatically” = 1 token
“.” = 1 token

Contractions split. Brand names compress. Whitespace counts. Your gut assumption about token density doesn’t hold.

This matters practically because:

You can’t budget costs accurately without knowing your actual token density
You can’t design efficient prompts if you’re guessing token counts
You’ll hit context limits unexpectedly if you misjudge how much content fits

Context Windows: The Hard Limits

Every LLM has a context window—the maximum number of tokens it can process in a single interaction. This includes your prompt (input tokens) plus the model’s response (output tokens).

Model	Context Window	Max Output
Claude 3.5 Sonnet	200,000 tokens	4,096 tokens
GPT-4o	128,000 tokens	4,096 tokens
GPT-4 Turbo	128,000 tokens	4,096 tokens
Llama 3 70B	8,192 tokens	N/A (varies)
Mistral 7B	32,768 tokens	N/A (varies)

Claude 3.5 Sonnet’s 200,000-token window is genuinely large. That’s roughly 150,000 words of input—an entire technical manual or a year of email. GPT-4o at 128,000 tokens handles most document-scale tasks. Llama 3 70B, if you’re running it locally, maxes out at 8,192 tokens—roughly 6,000 words.

The context window is a hard ceiling. If your input + desired output exceeds the window, the model either truncates your input (losing information) or returns an error. No graceful degradation. No overflow buffer. You hit the limit and your request fails.

This is why token counting matters before you build. If you’re processing documents and your workflow adds system prompts, retrieval context, few-shot examples, and the actual document itself, you can easily exceed the window.

How to Measure Tokens Efficiently

You have three options: count locally, use an API call, or estimate.

Option 1: Count Locally Using Tokenizer Libraries

For OpenAI models, use the official tokenizer:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Your prompt goes here. This is a test."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# Output: Token count: 15

For Anthropic Claude models, Anthropic provides a tokenizer:

import anthropic

client = anthropic.Anthropic()
text = "Your prompt goes here. This is a test."

response = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": text}]
)
print(f"Token count: {response.input_tokens}")
# Output: Token count: 15

Both approaches are fast and accurate. Use them before building anything at scale.

Option 2: Check Tokens Via API Response

Most API calls return token usage in the response. GPT models return usage like this:

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")

Claude returns the same information in the message response. Always capture this data when you’re building—it’s the ground truth for your actual usage.

Option 3: Estimate (When You Can’t Count)

If you need a rough estimate without running code, the rule of thumb is:

English text: 1 token ≈ 0.75 words (so 100 words ≈ 133 tokens)
Code: 1 token ≈ 0.5 words (code tokenizes less efficiently)
JSON: similar to code (structural characters add overhead)

This is approximate. If precision matters—and it should in production—measure directly instead of estimating.

Efficient Prompt Design Under Token Constraints

Given that tokens are limited and token counts aren’t intuitive, how do you design prompts that stay under budget while still being effective?

Technique 1: Prioritize Input Over Verbosity

Your system prompt doesn’t need to explain everything. Give it the rules that matter, cut the rest.

Bad approach (148 tokens):

You are an expert financial analyst with deep knowledge of 
investment strategies, market trends, and risk management. 
Your role is to analyze financial data carefully and provide 
insights. You should always be thorough, considerate, and precise 
in your analysis. Think step-by-step about the data and provide 
comprehensive explanations for your conclusions.

Better approach (42 tokens):

Analyze financial data. Provide specific insights with risk 
assessment. Be precise.

The second version uses 71% fewer tokens and actually constrains the model more clearly. Remove adjectives. Remove reassurances. Keep instructions.

Technique 2: Use Few-Shot Examples Selectively

Few-shot prompting (giving examples) improves output quality but costs tokens for every example you add. Use them strategically:

Skip few-shot for simple tasks (classification, straightforward extraction). The model knows these patterns.
Add one example for medium-complexity tasks (conditional logic, format requirements). One example ≈ 30–50 tokens depending on length.
Add two examples only for complex tasks (edge cases, rare patterns, specific style requirements).

Test it: run your prompt with zero examples, measure output quality. If quality is acceptable, you’ve saved tokens. Only add examples if the output degrades noticeably.

Technique 3: Compress Context Before Sending

If you’re processing long documents, extract the relevant parts before passing them to the model. This is where RAG (Retrieval-Augmented Generation) saves money—you retrieve only the relevant passages, not the entire document.

import anthropic

client = anthropic.Anthropic()

# Instead of sending a 50,000-token document
full_document = "... entire technical manual ..."

# Extract relevant section first (your own logic, not LLM)
relevant_section = extract_relevant_section(full_document, query)

# Send only the relevant part
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Context: {relevant_section}\n\nQuestion: {query}"
    }]
)

print(response.content[0].text)

This works because you’re controlling what the model sees. Most of the document stays on disk. Only the relevant 500 tokens get processed. Same answer. Different cost.

Technique 4: Reuse System Prompts Across Batches

If you’re making multiple API calls with the same system prompt, that’s fine—you’ll be charged for the system prompt tokens every time. But if your system prompt is large, some models offer “prompt caching” (Claude with the Anthropic API charges 10% for cached tokens on subsequent uses).

For AlgoVesta, we standardize on short, reusable system prompts. Instead of a 300-token prompt that gets sent with every request, we use a 40-token instruction set. The savings across millions of inferences is substantial.

Token Costs Across Models: The Comparison

Tokens are the currency, but the actual price per token varies wildly. Here’s real pricing as of March 2025:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Practical Cost Per 10K Input
Claude 3.5 Sonnet	$3.00	$15.00	$0.03
GPT-4o	$5.00	$15.00	$0.05
GPT-4 Turbo	$10.00	$30.00	$0.10
Llama 3 70B (self-hosted)	$0 (your infra)	$0 (your infra)	$0 (your infra)

Claude 3.5 Sonnet is the cheapest for input. GPT-4o costs 60% more per input token. GPT-4 Turbo costs 3x more. If you’re running millions of inferences, token efficiency directly compounds to cost savings.

Self-hosted models (Llama 3 70B, Mistral, etc.) shift the cost to your infrastructure. No per-token billing, but you pay for compute upfront. The break-even point depends on your volume and infrastructure costs.

Common Token Limit Pitfalls and How to Avoid Them

Pitfall 1: Forgetting System Prompt Tokens Count

Your available context isn’t the full window minus your input. It’s the window minus your system prompt minus your input. A 1,000-token system prompt isn’t free.

With GPT-4o (128,000-token window), a 1,000-token system prompt + 2,000-token input + 4,000-token desired output = 7,000 tokens used. You have 121,000 left for actual content. That’s still plenty, but if you’re using this math carelessly across many layers of prompts, you lose buffer fast.

Pitfall 2: Not Accounting for Repeated Tokens

If you’re building a system where a user asks multiple questions in sequence, don’t assume you can fit N questions into the context window. Each message adds metadata tokens. Each turn of conversation uses tokens for formatting and structure.

In practice, a 10-turn conversation with brief messages might use 2x the tokens of the raw text alone due to formatting overhead.

Pitfall 3: Truncating Long Inputs Naively

If you hit a token limit, don’t just cut text at position N and hope you didn’t lose critical information. Explicitly select the sections you need, or use RAG to surface relevant content.

Naive truncation causes hallucinations. The model tries to make sense of partial information and manufactures missing context.

Pitfall 4: Assuming All Whitespace Is Free

Blank lines, indentation, and extra spaces all tokenize. If you’re formatting a prompt with lots of whitespace for readability, you’re burning tokens on invisible characters.

For user-facing prompts, readability matters. For internal system prompts, compress: remove extra newlines, use single spaces, keep formatting minimal.

Practical Workflow: Designing a Token-Efficient System

Here’s the step-by-step approach I use when building new LLM features:

Step 1: Measure your baseline. Write your prompt the way that makes sense to you. Measure token count. Note it.

Step 2: Identify what tokens actually matter. System prompt. Input. Examples. Which bucket is consuming the most? If it’s examples, cut them. If it’s input, you need better document selection. If it’s system prompt, trim instructions.

Step 3: Set a token budget and stay in it. Decide the maximum tokens you can afford (cost) or tolerate (latency). Include buffer for variation.

Step 4: Test quality at budget. Run your prompt at the token limit with real inputs. Does output quality degrade? If yes, you’re cutting too much. If no, cut more.

Step 5: Monitor in production. Log actual token usage. If real-world usage consistently exceeds your estimate by 20%, adjust your budget. If it comes in under by 30%, you have room to add value (more context, better examples, clearer instructions).

Do This Today

Pick one prompt or system you’re using regularly. Run the official tokenizer for your model on it. Get the actual number. Then measure the word count and calculate your token-to-word ratio. Most people will be surprised it’s not 1:1. Understanding that delta—how your actual usage diverges from your assumption—is where efficiency begins.

If you’re building a multi-message system (chat, conversation, retrieval), measure a full interaction end-to-end. See what formatting overhead looks like in your specific workflow. That number doesn’t change—capture it once, use it for every future estimate.

Batikan

March 29, 2026 · 9 min read

Topics & Keywords

Learning Lab #llm context windows #model efficiency #prompt optimization #token counting #tokenization explained tokens token prompt model system prompt input count use

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

I tested 30 AI productivity tools across writing, coding, research, and operations. Only 8 actually saved measurable time. Here's which tools have real ROI, the workflows where they win, and why most "AI productivity tools" fail.

Apr 14, 2026 · 12 min read

→

What a Token Actually Is

Why Token Count Isn’t Intuitive

Context Windows: The Hard Limits

How to Measure Tokens Efficiently

Option 1: Count Locally Using Tokenizer Libraries

Option 2: Check Tokens Via API Response

Option 3: Estimate (When You Can’t Count)

Efficient Prompt Design Under Token Constraints

Technique 1: Prioritize Input Over Verbosity

Technique 2: Use Few-Shot Examples Selectively

Technique 3: Compress Context Before Sending

Technique 4: Reuse System Prompts Across Batches

Token Costs Across Models: The Comparison

Common Token Limit Pitfalls and How to Avoid Them

Pitfall 1: Forgetting System Prompt Tokens Count

Pitfall 2: Not Accounting for Repeated Tokens

Pitfall 3: Truncating Long Inputs Naively

Pitfall 4: Assuming All Whitespace Is Free

Practical Workflow: Designing a Token-Efficient System

Do This Today

📚 Related Articles

Stay ahead of the AI curve

Related Articles

Claude vs ChatGPT vs Gemini: Choose the Right LLM for Your Workflow

Build Your First AI Agent Without Code

Context Window Management: Processing Long Docs Without Losing Data

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Connect LLMs to Your Tools: A Workflow Automation Setup

Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

More from Prompt & Learn

Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

DeepL Adds Voice Translation. Here’s What Changes for Teams

10 Free AI Tools That Actually Pay for Themselves in 2026

Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

AI Tools That Actually Cut Hours From Your Week

Stay ahead of the AI curve