Skip to content
Learning Lab · 9 min read

Tokens: Why They Cost Money and How to Count Them

Tokens are the unit LLMs use to process language, but they don't map to words or characters. This guide explains what tokens are, why context windows matter, how to measure them accurately, and practical techniques to stay efficient under constraints.

Token Counting and Context Window Limits Explained

Every LLM interaction comes with a hidden tax. You hit “send” on a prompt, the model processes it, generates a response, and you pay by the token. Not by the word. Not by the character. By the token—a unit that doesn’t map cleanly to anything you can see on your screen.

Most people building with LLMs don’t actually know what a token is. They see a price per token, multiply it out, and move on. Then their costs explode for reasons they can’t explain. Or they bump into context window limits and wonder why their 8,000-word document got cut off mid-processing.

Tokenization isn’t just a billing detail. It’s the boundary condition of everything you build with language models. Understand it, and you unlock efficiency gains that compound across every system you run. Ignore it, and you’ll waste money, hit limits at frustrating moments, and build worse products.

What a Token Actually Is

A token is the smallest unit a language model works with. It’s not a word. It’s not a character. It’s something in between, and the exact boundaries shift depending on which model you use and which tokenizer encodes your text.

Think of it like this: a tokenizer is a lookup table. Raw text goes in one side. On the other side comes a sequence of integers—token IDs. The model sees those integers, processes them, and generates new integers back out.

Here’s what happens in practice:

  • The word “hello” = 1 token
  • The word “unfortunately” = 2 tokens (un + fortunately)
  • The punctuation mark “.” = 1 token
  • A space before a word = 1 token (usually)
  • The sequence “\n\n” (paragraph break) = 1 token

Short, common words compress into single tokens. Longer or rarer words split across multiple tokens. Special characters, numbers, and whitespace all have their own encoding rules.

Different models use different tokenizers. OpenAI’s GPT models (GPT-4o, GPT-4 Turbo) use the cl100k_base tokenizer. Anthropic’s Claude models use a different tokenizer. Llama 3 70B uses yet another. This matters: the same sentence tokenizes to different lengths depending on which model processes it.

Why Token Count Isn’t Intuitive

You write 500 words. You assume that’s roughly 500 tokens, maybe 700 if you count formatting. Then you run it through a tokenizer and get 820 tokens. Or 1,240. Or something else entirely.

The mismatch happens because tokens don’t follow word boundaries. Here’s a real example:

# Input text
"The ChatGPT API doesn't return token counts automatically."

# Word count: 9
# Token count (OpenAI cl100k_base): 13

Why 13? Break it down:

  • “The” = 1 token
  • “ChatGPT” = 1 token
  • ” API” = 1 token (space + word)
  • “doesn” = 1 token
  • “‘t” = 1 token (contraction splits)
  • ” return” = 1 token
  • ” token” = 1 token
  • ” counts” = 1 token
  • ” automatically” = 1 token
  • “.” = 1 token

Contractions split. Brand names compress. Whitespace counts. Your gut assumption about token density doesn’t hold.

This matters practically because:

  • You can’t budget costs accurately without knowing your actual token density
  • You can’t design efficient prompts if you’re guessing token counts
  • You’ll hit context limits unexpectedly if you misjudge how much content fits

Context Windows: The Hard Limits

Every LLM has a context window—the maximum number of tokens it can process in a single interaction. This includes your prompt (input tokens) plus the model’s response (output tokens).

Model Context Window Max Output
Claude 3.5 Sonnet 200,000 tokens 4,096 tokens
GPT-4o 128,000 tokens 4,096 tokens
GPT-4 Turbo 128,000 tokens 4,096 tokens
Llama 3 70B 8,192 tokens N/A (varies)
Mistral 7B 32,768 tokens N/A (varies)

Claude 3.5 Sonnet’s 200,000-token window is genuinely large. That’s roughly 150,000 words of input—an entire technical manual or a year of email. GPT-4o at 128,000 tokens handles most document-scale tasks. Llama 3 70B, if you’re running it locally, maxes out at 8,192 tokens—roughly 6,000 words.

The context window is a hard ceiling. If your input + desired output exceeds the window, the model either truncates your input (losing information) or returns an error. No graceful degradation. No overflow buffer. You hit the limit and your request fails.

This is why token counting matters before you build. If you’re processing documents and your workflow adds system prompts, retrieval context, few-shot examples, and the actual document itself, you can easily exceed the window.

How to Measure Tokens Efficiently

You have three options: count locally, use an API call, or estimate.

Option 1: Count Locally Using Tokenizer Libraries

For OpenAI models, use the official tokenizer:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Your prompt goes here. This is a test."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
# Output: Token count: 15

For Anthropic Claude models, Anthropic provides a tokenizer:

import anthropic

client = anthropic.Anthropic()
text = "Your prompt goes here. This is a test."

response = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": text}]
)
print(f"Token count: {response.input_tokens}")
# Output: Token count: 15

Both approaches are fast and accurate. Use them before building anything at scale.

Option 2: Check Tokens Via API Response

Most API calls return token usage in the response. GPT models return usage like this:

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")

Claude returns the same information in the message response. Always capture this data when you’re building—it’s the ground truth for your actual usage.

Option 3: Estimate (When You Can’t Count)

If you need a rough estimate without running code, the rule of thumb is:

  • English text: 1 token ≈ 0.75 words (so 100 words ≈ 133 tokens)
  • Code: 1 token ≈ 0.5 words (code tokenizes less efficiently)
  • JSON: similar to code (structural characters add overhead)

This is approximate. If precision matters—and it should in production—measure directly instead of estimating.

Efficient Prompt Design Under Token Constraints

Given that tokens are limited and token counts aren’t intuitive, how do you design prompts that stay under budget while still being effective?

Technique 1: Prioritize Input Over Verbosity

Your system prompt doesn’t need to explain everything. Give it the rules that matter, cut the rest.

Bad approach (148 tokens):

You are an expert financial analyst with deep knowledge of 
investment strategies, market trends, and risk management. 
Your role is to analyze financial data carefully and provide 
insights. You should always be thorough, considerate, and precise 
in your analysis. Think step-by-step about the data and provide 
comprehensive explanations for your conclusions.

Better approach (42 tokens):

Analyze financial data. Provide specific insights with risk 
assessment. Be precise.

The second version uses 71% fewer tokens and actually constrains the model more clearly. Remove adjectives. Remove reassurances. Keep instructions.

Technique 2: Use Few-Shot Examples Selectively

Few-shot prompting (giving examples) improves output quality but costs tokens for every example you add. Use them strategically:

  • Skip few-shot for simple tasks (classification, straightforward extraction). The model knows these patterns.
  • Add one example for medium-complexity tasks (conditional logic, format requirements). One example ≈ 30–50 tokens depending on length.
  • Add two examples only for complex tasks (edge cases, rare patterns, specific style requirements).

Test it: run your prompt with zero examples, measure output quality. If quality is acceptable, you’ve saved tokens. Only add examples if the output degrades noticeably.

Technique 3: Compress Context Before Sending

If you’re processing long documents, extract the relevant parts before passing them to the model. This is where RAG (Retrieval-Augmented Generation) saves money—you retrieve only the relevant passages, not the entire document.

import anthropic

client = anthropic.Anthropic()

# Instead of sending a 50,000-token document
full_document = "... entire technical manual ..."

# Extract relevant section first (your own logic, not LLM)
relevant_section = extract_relevant_section(full_document, query)

# Send only the relevant part
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Context: {relevant_section}\n\nQuestion: {query}"
    }]
)

print(response.content[0].text)

This works because you’re controlling what the model sees. Most of the document stays on disk. Only the relevant 500 tokens get processed. Same answer. Different cost.

Technique 4: Reuse System Prompts Across Batches

If you’re making multiple API calls with the same system prompt, that’s fine—you’ll be charged for the system prompt tokens every time. But if your system prompt is large, some models offer “prompt caching” (Claude with the Anthropic API charges 10% for cached tokens on subsequent uses).

For AlgoVesta, we standardize on short, reusable system prompts. Instead of a 300-token prompt that gets sent with every request, we use a 40-token instruction set. The savings across millions of inferences is substantial.

Token Costs Across Models: The Comparison

Tokens are the currency, but the actual price per token varies wildly. Here’s real pricing as of March 2025:

Model Input (per 1M tokens) Output (per 1M tokens) Practical Cost Per 10K Input
Claude 3.5 Sonnet $3.00 $15.00 $0.03
GPT-4o $5.00 $15.00 $0.05
GPT-4 Turbo $10.00 $30.00 $0.10
Llama 3 70B (self-hosted) $0 (your infra) $0 (your infra) $0 (your infra)

Claude 3.5 Sonnet is the cheapest for input. GPT-4o costs 60% more per input token. GPT-4 Turbo costs 3x more. If you’re running millions of inferences, token efficiency directly compounds to cost savings.

Self-hosted models (Llama 3 70B, Mistral, etc.) shift the cost to your infrastructure. No per-token billing, but you pay for compute upfront. The break-even point depends on your volume and infrastructure costs.

Common Token Limit Pitfalls and How to Avoid Them

Pitfall 1: Forgetting System Prompt Tokens Count

Your available context isn’t the full window minus your input. It’s the window minus your system prompt minus your input. A 1,000-token system prompt isn’t free.

With GPT-4o (128,000-token window), a 1,000-token system prompt + 2,000-token input + 4,000-token desired output = 7,000 tokens used. You have 121,000 left for actual content. That’s still plenty, but if you’re using this math carelessly across many layers of prompts, you lose buffer fast.

Pitfall 2: Not Accounting for Repeated Tokens

If you’re building a system where a user asks multiple questions in sequence, don’t assume you can fit N questions into the context window. Each message adds metadata tokens. Each turn of conversation uses tokens for formatting and structure.

In practice, a 10-turn conversation with brief messages might use 2x the tokens of the raw text alone due to formatting overhead.

Pitfall 3: Truncating Long Inputs Naively

If you hit a token limit, don’t just cut text at position N and hope you didn’t lose critical information. Explicitly select the sections you need, or use RAG to surface relevant content.

Naive truncation causes hallucinations. The model tries to make sense of partial information and manufactures missing context.

Pitfall 4: Assuming All Whitespace Is Free

Blank lines, indentation, and extra spaces all tokenize. If you’re formatting a prompt with lots of whitespace for readability, you’re burning tokens on invisible characters.

For user-facing prompts, readability matters. For internal system prompts, compress: remove extra newlines, use single spaces, keep formatting minimal.

Practical Workflow: Designing a Token-Efficient System

Here’s the step-by-step approach I use when building new LLM features:

Step 1: Measure your baseline. Write your prompt the way that makes sense to you. Measure token count. Note it.

Step 2: Identify what tokens actually matter. System prompt. Input. Examples. Which bucket is consuming the most? If it’s examples, cut them. If it’s input, you need better document selection. If it’s system prompt, trim instructions.

Step 3: Set a token budget and stay in it. Decide the maximum tokens you can afford (cost) or tolerate (latency). Include buffer for variation.

Step 4: Test quality at budget. Run your prompt at the token limit with real inputs. Does output quality degrade? If yes, you’re cutting too much. If no, cut more.

Step 5: Monitor in production. Log actual token usage. If real-world usage consistently exceeds your estimate by 20%, adjust your budget. If it comes in under by 30%, you have room to add value (more context, better examples, clearer instructions).

Do This Today

Pick one prompt or system you’re using regularly. Run the official tokenizer for your model on it. Get the actual number. Then measure the word count and calculate your token-to-word ratio. Most people will be surprised it’s not 1:1. Understanding that delta—how your actual usage diverges from your assumption—is where efficiency begins.

If you’re building a multi-message system (chat, conversation, retrieval), measure a full interaction end-to-end. See what formatting overhead looks like in your specific workflow. That number doesn’t change—capture it once, use it for every future estimate.

Batikan
· 9 min read
Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Related Articles

Claude vs ChatGPT vs Gemini: Choose the Right LLM for Your Workflow
Learning Lab

Claude vs ChatGPT vs Gemini: Choose the Right LLM for Your Workflow

Claude, ChatGPT, and Gemini each excel at different tasks. This guide breaks down real performance differences, hallucination rates, cost trade-offs, and specific workflows where each model wins—with concrete prompts you can use immediately.

· 4 min read
Build Your First AI Agent Without Code
Learning Lab

Build Your First AI Agent Without Code

Build your first working AI agent without code or API knowledge. Learn the three agent architectures, compare platforms, and step through a real example that handles email triage and CRM lookup—from setup to deployment.

· 13 min read
Context Window Management: Processing Long Docs Without Losing Data
Learning Lab

Context Window Management: Processing Long Docs Without Losing Data

Context window limits break production AI systems. Learn three concrete techniques to handle long documents and conversations without losing data or burning API costs.

· 3 min read
Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management
Learning Lab

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Learn how to build production-ready AI agents by mastering tool calling contracts, structuring agent loops correctly, and separating memory into session, knowledge, and execution layers. Includes working Python code examples.

· 5 min read
Connect LLMs to Your Tools: A Workflow Automation Setup
Learning Lab

Connect LLMs to Your Tools: A Workflow Automation Setup

Connect ChatGPT, Claude, and Gemini to Slack, Notion, and Sheets through APIs and automation platforms. Learn the trade-offs between models, build a working Slack bot, and automate your first workflow today.

· 5 min read
Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique
Learning Lab

Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

Zero-shot, few-shot, and chain-of-thought are three distinct prompting techniques with different accuracy, latency, and cost profiles. Learn when to use each, how to combine them, and how to measure which approach works best for your specific task.

· 15 min read

More from Prompt & Learn

Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best
AI Tools Directory

Surfer vs Ahrefs AI vs SEMrush: Which Ranks Content Best

Three AI SEO tools claim they'll fix your ranking problem: Surfer, Ahrefs AI, and SEMrush. Each analyzes competing content differently—leading to different recommendations and different results. Here's what actually works, when each tool fails, and which one to buy based on your team's constraints.

· 9 min read
Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared
AI Tools Directory

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

Figma AI, Canva AI, and Adobe Firefly take different approaches to generative design. Figma prioritizes seamless integration; Canva prioritizes speed; Firefly prioritizes output quality. Here's which tool fits your actual workflow.

· 4 min read
DeepL Adds Voice Translation. Here’s What Changes for Teams
AI Tools Directory

DeepL Adds Voice Translation. Here’s What Changes for Teams

DeepL announced real-time voice translation for Zoom and Microsoft Teams. Unlike existing solutions, it builds on DeepL's text translation strength — direct translation models with lower latency. Here's why this matters and where it breaks.

· 3 min read
10 Free AI Tools That Actually Pay for Themselves in 2026
AI Tools Directory

10 Free AI Tools That Actually Pay for Themselves in 2026

Ten free AI tools that actually replace paid SaaS in 2026: Claude, Perplexity, Llama 3.2, DeepSeek R1, GitHub Copilot, OpenRouter, HuggingFace, Jina, Playwright, and Mistral. Each tested across real workflows with realistic rate limits, accuracy benchmarks, and cost comparisons.

· 9 min read
Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works
AI Tools Directory

Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

Three coding assistants dominate 2026. Copilot stays safe for enterprises. Cursor wins on speed and accuracy for most developers. Windsurf's agent mode actually executes code to prevent hallucinations. Here's how to pick.

· 4 min read
AI Tools That Actually Cut Hours From Your Week
AI Tools Directory

AI Tools That Actually Cut Hours From Your Week

I tested 30 AI productivity tools across writing, coding, research, and operations. Only 8 actually saved measurable time. Here's which tools have real ROI, the workflows where they win, and why most "AI productivity tools" fail.

· 12 min read

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies. No noise, only signal.

Follow Prompt Builder Prompt Builder