Every LLM interaction comes with a hidden tax. You hit “send” on a prompt, the model processes it, generates a response, and you pay by the token. Not by the word. Not by the character. By the token—a unit that doesn’t map cleanly to anything you can see on your screen.
Most people building with LLMs don’t actually know what a token is. They see a price per token, multiply it out, and move on. Then their costs explode for reasons they can’t explain. Or they bump into context window limits and wonder why their 8,000-word document got cut off mid-processing.
Tokenization isn’t just a billing detail. It’s the boundary condition of everything you build with language models. Understand it, and you unlock efficiency gains that compound across every system you run. Ignore it, and you’ll waste money, hit limits at frustrating moments, and build worse products.
What a Token Actually Is
A token is the smallest unit a language model works with. It’s not a word. It’s not a character. It’s something in between, and the exact boundaries shift depending on which model you use and which tokenizer encodes your text.
Think of it like this: a tokenizer is a lookup table. Raw text goes in one side. On the other side comes a sequence of integers—token IDs. The model sees those integers, processes them, and generates new integers back out.
Here’s what happens in practice:
- The word “hello” = 1 token
- The word “unfortunately” = 2 tokens (un + fortunately)
- The punctuation mark “.” = 1 token
- A space before a word = 1 token (usually)
- The sequence “\n\n” (paragraph break) = 1 token
Short, common words compress into single tokens. Longer or rarer words split across multiple tokens. Special characters, numbers, and whitespace all have their own encoding rules.
Different models use different tokenizers. OpenAI’s GPT models (GPT-4o, GPT-4 Turbo) use the cl100k_base tokenizer. Anthropic’s Claude models use a different tokenizer. Llama 3 70B uses yet another. This matters: the same sentence tokenizes to different lengths depending on which model processes it.
Why Token Count Isn’t Intuitive
You write 500 words. You assume that’s roughly 500 tokens, maybe 700 if you count formatting. Then you run it through a tokenizer and get 820 tokens. Or 1,240. Or something else entirely.
The mismatch happens because tokens don’t follow word boundaries. Here’s a real example:
# Input text
"The ChatGPT API doesn't return token counts automatically."
# Word count: 9
# Token count (OpenAI cl100k_base): 13
Why 13? Break it down:
- “The” = 1 token
- “ChatGPT” = 1 token
- ” API” = 1 token (space + word)
- “doesn” = 1 token
- “‘t” = 1 token (contraction splits)
- ” return” = 1 token
- ” token” = 1 token
- ” counts” = 1 token
- ” automatically” = 1 token
- “.” = 1 token
Contractions split. Brand names compress. Whitespace counts. Your gut assumption about token density doesn’t hold.
This matters practically because:
- You can’t budget costs accurately without knowing your actual token density
- You can’t design efficient prompts if you’re guessing token counts
- You’ll hit context limits unexpectedly if you misjudge how much content fits
Context Windows: The Hard Limits
Every LLM has a context window—the maximum number of tokens it can process in a single interaction. This includes your prompt (input tokens) plus the model’s response (output tokens).
| Model | Context Window | Max Output |
|---|---|---|
| Claude 3.5 Sonnet | 200,000 tokens | 4,096 tokens |
| GPT-4o | 128,000 tokens | 4,096 tokens |
| GPT-4 Turbo | 128,000 tokens | 4,096 tokens |
| Llama 3 70B | 8,192 tokens | N/A (varies) |
| Mistral 7B | 32,768 tokens | N/A (varies) |
Claude 3.5 Sonnet’s 200,000-token window is genuinely large. That’s roughly 150,000 words of input—an entire technical manual or a year of email. GPT-4o at 128,000 tokens handles most document-scale tasks. Llama 3 70B, if you’re running it locally, maxes out at 8,192 tokens—roughly 6,000 words.
The context window is a hard ceiling. If your input + desired output exceeds the window, the model either truncates your input (losing information) or returns an error. No graceful degradation. No overflow buffer. You hit the limit and your request fails.
This is why token counting matters before you build. If you’re processing documents and your workflow adds system prompts, retrieval context, few-shot examples, and the actual document itself, you can easily exceed the window.
How to Measure Tokens Efficiently
You have three options: count locally, use an API call, or estimate.
Option 1: Count Locally Using Tokenizer Libraries
For OpenAI models, use the official tokenizer:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Your prompt goes here. This is a test."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# Output: Token count: 15
For Anthropic Claude models, Anthropic provides a tokenizer:
import anthropic
client = anthropic.Anthropic()
text = "Your prompt goes here. This is a test."
response = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": text}]
)
print(f"Token count: {response.input_tokens}")
# Output: Token count: 15
Both approaches are fast and accurate. Use them before building anything at scale.
Option 2: Check Tokens Via API Response
Most API calls return token usage in the response. GPT models return usage like this:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")
Claude returns the same information in the message response. Always capture this data when you’re building—it’s the ground truth for your actual usage.
Option 3: Estimate (When You Can’t Count)
If you need a rough estimate without running code, the rule of thumb is:
- English text: 1 token ≈ 0.75 words (so 100 words ≈ 133 tokens)
- Code: 1 token ≈ 0.5 words (code tokenizes less efficiently)
- JSON: similar to code (structural characters add overhead)
This is approximate. If precision matters—and it should in production—measure directly instead of estimating.
Efficient Prompt Design Under Token Constraints
Given that tokens are limited and token counts aren’t intuitive, how do you design prompts that stay under budget while still being effective?
Technique 1: Prioritize Input Over Verbosity
Your system prompt doesn’t need to explain everything. Give it the rules that matter, cut the rest.
Bad approach (148 tokens):
You are an expert financial analyst with deep knowledge of
investment strategies, market trends, and risk management.
Your role is to analyze financial data carefully and provide
insights. You should always be thorough, considerate, and precise
in your analysis. Think step-by-step about the data and provide
comprehensive explanations for your conclusions.
Better approach (42 tokens):
Analyze financial data. Provide specific insights with risk
assessment. Be precise.
The second version uses 71% fewer tokens and actually constrains the model more clearly. Remove adjectives. Remove reassurances. Keep instructions.
Technique 2: Use Few-Shot Examples Selectively
Few-shot prompting (giving examples) improves output quality but costs tokens for every example you add. Use them strategically:
- Skip few-shot for simple tasks (classification, straightforward extraction). The model knows these patterns.
- Add one example for medium-complexity tasks (conditional logic, format requirements). One example ≈ 30–50 tokens depending on length.
- Add two examples only for complex tasks (edge cases, rare patterns, specific style requirements).
Test it: run your prompt with zero examples, measure output quality. If quality is acceptable, you’ve saved tokens. Only add examples if the output degrades noticeably.
Technique 3: Compress Context Before Sending
If you’re processing long documents, extract the relevant parts before passing them to the model. This is where RAG (Retrieval-Augmented Generation) saves money—you retrieve only the relevant passages, not the entire document.
import anthropic
client = anthropic.Anthropic()
# Instead of sending a 50,000-token document
full_document = "... entire technical manual ..."
# Extract relevant section first (your own logic, not LLM)
relevant_section = extract_relevant_section(full_document, query)
# Send only the relevant part
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context: {relevant_section}\n\nQuestion: {query}"
}]
)
print(response.content[0].text)
This works because you’re controlling what the model sees. Most of the document stays on disk. Only the relevant 500 tokens get processed. Same answer. Different cost.
Technique 4: Reuse System Prompts Across Batches
If you’re making multiple API calls with the same system prompt, that’s fine—you’ll be charged for the system prompt tokens every time. But if your system prompt is large, some models offer “prompt caching” (Claude with the Anthropic API charges 10% for cached tokens on subsequent uses).
For AlgoVesta, we standardize on short, reusable system prompts. Instead of a 300-token prompt that gets sent with every request, we use a 40-token instruction set. The savings across millions of inferences is substantial.
Token Costs Across Models: The Comparison
Tokens are the currency, but the actual price per token varies wildly. Here’s real pricing as of March 2025:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Practical Cost Per 10K Input |
|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 | $0.03 |
| GPT-4o | $5.00 | $15.00 | $0.05 |
| GPT-4 Turbo | $10.00 | $30.00 | $0.10 |
| Llama 3 70B (self-hosted) | $0 (your infra) | $0 (your infra) | $0 (your infra) |
Claude 3.5 Sonnet is the cheapest for input. GPT-4o costs 60% more per input token. GPT-4 Turbo costs 3x more. If you’re running millions of inferences, token efficiency directly compounds to cost savings.
Self-hosted models (Llama 3 70B, Mistral, etc.) shift the cost to your infrastructure. No per-token billing, but you pay for compute upfront. The break-even point depends on your volume and infrastructure costs.
Common Token Limit Pitfalls and How to Avoid Them
Pitfall 1: Forgetting System Prompt Tokens Count
Your available context isn’t the full window minus your input. It’s the window minus your system prompt minus your input. A 1,000-token system prompt isn’t free.
With GPT-4o (128,000-token window), a 1,000-token system prompt + 2,000-token input + 4,000-token desired output = 7,000 tokens used. You have 121,000 left for actual content. That’s still plenty, but if you’re using this math carelessly across many layers of prompts, you lose buffer fast.
Pitfall 2: Not Accounting for Repeated Tokens
If you’re building a system where a user asks multiple questions in sequence, don’t assume you can fit N questions into the context window. Each message adds metadata tokens. Each turn of conversation uses tokens for formatting and structure.
In practice, a 10-turn conversation with brief messages might use 2x the tokens of the raw text alone due to formatting overhead.
Pitfall 3: Truncating Long Inputs Naively
If you hit a token limit, don’t just cut text at position N and hope you didn’t lose critical information. Explicitly select the sections you need, or use RAG to surface relevant content.
Naive truncation causes hallucinations. The model tries to make sense of partial information and manufactures missing context.
Pitfall 4: Assuming All Whitespace Is Free
Blank lines, indentation, and extra spaces all tokenize. If you’re formatting a prompt with lots of whitespace for readability, you’re burning tokens on invisible characters.
For user-facing prompts, readability matters. For internal system prompts, compress: remove extra newlines, use single spaces, keep formatting minimal.
Practical Workflow: Designing a Token-Efficient System
Here’s the step-by-step approach I use when building new LLM features:
Step 1: Measure your baseline. Write your prompt the way that makes sense to you. Measure token count. Note it.
Step 2: Identify what tokens actually matter. System prompt. Input. Examples. Which bucket is consuming the most? If it’s examples, cut them. If it’s input, you need better document selection. If it’s system prompt, trim instructions.
Step 3: Set a token budget and stay in it. Decide the maximum tokens you can afford (cost) or tolerate (latency). Include buffer for variation.
Step 4: Test quality at budget. Run your prompt at the token limit with real inputs. Does output quality degrade? If yes, you’re cutting too much. If no, cut more.
Step 5: Monitor in production. Log actual token usage. If real-world usage consistently exceeds your estimate by 20%, adjust your budget. If it comes in under by 30%, you have room to add value (more context, better examples, clearer instructions).
Do This Today
Pick one prompt or system you’re using regularly. Run the official tokenizer for your model on it. Get the actual number. Then measure the word count and calculate your token-to-word ratio. Most people will be surprised it’s not 1:1. Understanding that delta—how your actual usage diverges from your assumption—is where efficiency begins.
If you’re building a multi-message system (chat, conversation, retrieval), measure a full interaction end-to-end. See what formatting overhead looks like in your specific workflow. That number doesn’t change—capture it once, use it for every future estimate.