Your API call completes. Claude or GPT-4o returns an answer. But somewhere in the middle of your 8,000-word document, it stopped paying attention. Not because the model broke — because you ran out of context window.
Context window is the maximum number of tokens an LLM can process in a single request. Claude 3.5 Sonnet handles 200,000 tokens. GPT-4o handles 128,000. Llama 3 70B handles 8,000. Go over that limit, your request fails. Stay under but cram too much, the model’s attention degrades on material buried in the middle — a phenomenon called the “lost in the middle” problem.
This isn’t a theoretical limitation. It breaks real production systems: customer support chatbots that can’t remember early conversation turns, document analysis pipelines that miss critical sections, and research workflows that choke on PDFs.
How Context Window Actually Works
Every word, number, punctuation mark, and whitespace gets converted to tokens before the model processes it. One token ≈ 4 characters in English, but varies by language and structure.
A 200,000-token Claude Sonnet window breaks down like this:
- System prompt: 500 tokens
- User input (your document): 150,000 tokens
- Conversation history: 30,000 tokens
- Reserved for output: 19,500 tokens
You’ve got 19,500 tokens left for the model’s response. If you need a detailed analysis, that’s enough. If you need multiple reasoning steps, you’re cutting it close.
The math is rigid: input tokens + output tokens ≤ context window. Exceed it, and most API providers reject the request with a 400 error. Some services queue it. None of them silently truncate.
The Lost in the Middle Problem Is Real
In September 2023, researchers at MIT tested whether LLMs actually use all the context they claim to support. They inserted a key fact into different positions of a long document and asked the model to retrieve it.
The finding: models perform best on information at the beginning and end of context. Information in the middle — positions 40–60% through the document — gets processed with 25–35% lower accuracy than the same information at the start.
This isn’t a Claude or GPT-4o problem specifically. It affects all transformer-based models. The reason: attention patterns in language models weight earlier tokens more heavily by default, and the model “saves” capacity for the final summary and response.
Practical impact: if your customer support bot processes a 5-message conversation, early messages get degraded treatment. If your document analyzer processes a 50-page PDF, pages 20–30 become invisible.
Technique 1: Summarize Before Processing
Instead of sending the entire document, compress it first.
# Bad approach: send full document
User: "Analyze this 30-page contract. What are the key obligations?"
[send entire 30-page contract as input]
The model uses valuable context window on boilerplate sections that don’t matter.
# Improved approach: two-stage process
Step 1: Summarize the document
Prompt: "Summarize this contract in 500 tokens. Keep obligations, timeline, and payment terms. Remove boilerplate."
[send full contract]
Output: 500-token summary
Step 2: Analyze the summary
Prompt: "Based on this summary, list all counterparty obligations and which party bears each risk."
[send the 500-token summary]
Output: Structured analysis
Why this works: you use context window on the first call to extract signal, then process only the signal on the second call. The second call is faster, cheaper, and more accurate because the model works with distilled information.
Real token savings: a 50-page contract (≈25,000 tokens) becomes a 500-token summary. Your second analysis call drops from 25,500 tokens to 1,000.
Technique 2: Chunk and Re-Rank for Conversation History
Long conversations are the hardest context problem because every new message appends to history. After 15 exchanges, you’ve consumed 8,000–15,000 tokens just on conversation memory.
# Problem: conversation history bloats
Conversation turn 20:
System: [original system prompt]
User: [turn 1]
Assistant: [response]
User: [turn 2]
Assistant: [response]
... [turns 3–19] ...
User: [turn 20] <- new message
Assistant: [model responds]
By turn 20, the model has seen 15+ irrelevant exchanges before reaching the current question. By turn 50, context is mostly dead weight.
Solution: use a re-ranking approach.
After every 8–10 turns, score each historical message by relevance to the current conversation thread using embeddings or a lightweight language model. Keep only the top 5–7 most relevant past turns, plus the 2 most recent turns. Discard the rest.
import openai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def prune_conversation_history(history, current_message, max_turns=7):
# Embed all past user messages
past_messages = [h["content"] for h in history if h["role"] == "user"]
# Get embeddings (using OpenAI API)
response = openai.Embedding.create(
input=past_messages + [current_message],
model="text-embedding-3-small"
)
embeddings = response["data"]
current_embedding = np.array(embeddings[-1]["embedding"])
past_embeddings = np.array([e["embedding"] for e in embeddings[:-1]])
# Score by relevance
relevance_scores = cosine_similarity(
[current_embedding],
past_embeddings
)[0]
# Keep top N + most recent 2 turns
top_indices = np.argsort(relevance_scores)[-max_turns:].tolist()
top_indices.sort() # preserve chronological order
pruned = [history[i] for i in top_indices]
pruned.extend(history[-2:]) # add last 2 turns
return pruned
Why this works: a 50-turn conversation (≈12,000 tokens) becomes 7–8 relevant turns (≈2,000 tokens). The model still has context but focuses on turns that matter to the current question.
Failure mode: if all historical messages are equally relevant (e.g., iterative refinement of a design), re-ranking doesn't help. In that case, use rolling windows instead — keep only the last 10 turns, discard older ones entirely.
Technique 3: Strategic Chunking for Documents
For PDFs, research papers, or long documents, split into sections and process each independently, then synthesize results.
# Workflow: chunk → analyze → synthesize
Document: research_paper.pdf (80 pages, ≈40,000 tokens)
Step 1: Chunk by section
[Abstract] → 500 tokens
[Introduction] → 3,000 tokens
[Methods] → 4,000 tokens
[Results] → 5,000 tokens
[Discussion] → 4,000 tokens
[Conclusion] → 1,000 tokens
Step 2: Analyze each chunk
For each chunk:
Prompt: "Extract key findings from this section."
Input: [one section only]
Output: 200-token summary
Step 3: Synthesize
Prompt: "Here are key findings from each section of a research paper. What is the paper's central contribution?"
Input: [all 6 section summaries, ≈1,200 tokens total]
Output: Final analysis
Cost vs. context: instead of one 40,000-token input, you make 6 smaller calls (each ≤5,000 tokens) plus one synthesis call. More API calls, but each uses less context, hitting Claude Sonnet or GPT-4o in their accuracy sweet spot. Total token consumption is similar, but accuracy improves because the model focuses on one section at a time.
What to Do Today
Pick one document or conversation you're processing with an LLM right now. Measure its token count using OpenAI's tiktoken library (Python) or Claude's token_counter. If it exceeds 50% of your model's context window, apply one technique from this article: summarize before analyzing, re-rank old conversation turns, or chunk by section. Run both versions side-by-side for one week. Track cost and accuracy. You'll see which approach fits your use case.