Learning Lab April 16, 2026 · 3 min read

Context Window Management: Processing Long Docs Without Losing Data

Context window limits break production AI systems. Learn three concrete techniques to handle long documents and conversations without losing data or burning API costs.

Your API call completes. Claude or GPT-4o returns an answer. But somewhere in the middle of your 8,000-word document, it stopped paying attention. Not because the model broke — because you ran out of context window.

Context window is the maximum number of tokens an LLM can process in a single request. Claude 3.5 Sonnet handles 200,000 tokens. GPT-4o handles 128,000. Llama 3 70B handles 8,000. Go over that limit, your request fails. Stay under but cram too much, the model’s attention degrades on material buried in the middle — a phenomenon called the “lost in the middle” problem.

This isn’t a theoretical limitation. It breaks real production systems: customer support chatbots that can’t remember early conversation turns, document analysis pipelines that miss critical sections, and research workflows that choke on PDFs.

How Context Window Actually Works

Every word, number, punctuation mark, and whitespace gets converted to tokens before the model processes it. One token ≈ 4 characters in English, but varies by language and structure.

A 200,000-token Claude Sonnet window breaks down like this:

System prompt: 500 tokens
User input (your document): 150,000 tokens
Conversation history: 30,000 tokens
Reserved for output: 19,500 tokens

You’ve got 19,500 tokens left for the model’s response. If you need a detailed analysis, that’s enough. If you need multiple reasoning steps, you’re cutting it close.

The math is rigid: input tokens + output tokens ≤ context window. Exceed it, and most API providers reject the request with a 400 error. Some services queue it. None of them silently truncate.

The Lost in the Middle Problem Is Real

In September 2023, researchers at MIT tested whether LLMs actually use all the context they claim to support. They inserted a key fact into different positions of a long document and asked the model to retrieve it.

The finding: models perform best on information at the beginning and end of context. Information in the middle — positions 40–60% through the document — gets processed with 25–35% lower accuracy than the same information at the start.

This isn’t a Claude or GPT-4o problem specifically. It affects all transformer-based models. The reason: attention patterns in language models weight earlier tokens more heavily by default, and the model “saves” capacity for the final summary and response.

Practical impact: if your customer support bot processes a 5-message conversation, early messages get degraded treatment. If your document analyzer processes a 50-page PDF, pages 20–30 become invisible.

Technique 1: Summarize Before Processing

Instead of sending the entire document, compress it first.

# Bad approach: send full document
User: "Analyze this 30-page contract. What are the key obligations?"
[send entire 30-page contract as input]

The model uses valuable context window on boilerplate sections that don’t matter.

# Improved approach: two-stage process
Step 1: Summarize the document
Prompt: "Summarize this contract in 500 tokens. Keep obligations, timeline, and payment terms. Remove boilerplate."
[send full contract]
Output: 500-token summary

Step 2: Analyze the summary
Prompt: "Based on this summary, list all counterparty obligations and which party bears each risk."
[send the 500-token summary]
Output: Structured analysis

Why this works: you use context window on the first call to extract signal, then process only the signal on the second call. The second call is faster, cheaper, and more accurate because the model works with distilled information.

Real token savings: a 50-page contract (≈25,000 tokens) becomes a 500-token summary. Your second analysis call drops from 25,500 tokens to 1,000.

Technique 2: Chunk and Re-Rank for Conversation History

Long conversations are the hardest context problem because every new message appends to history. After 15 exchanges, you’ve consumed 8,000–15,000 tokens just on conversation memory.

# Problem: conversation history bloats
Conversation turn 20:
System: [original system prompt]
User: [turn 1]
Assistant: [response]
User: [turn 2]
Assistant: [response]
... [turns 3–19] ...
User: [turn 20] <- new message
Assistant: [model responds]

By turn 20, the model has seen 15+ irrelevant exchanges before reaching the current question. By turn 50, context is mostly dead weight.

Solution: use a re-ranking approach.

After every 8–10 turns, score each historical message by relevance to the current conversation thread using embeddings or a lightweight language model. Keep only the top 5–7 most relevant past turns, plus the 2 most recent turns. Discard the rest.

import openai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def prune_conversation_history(history, current_message, max_turns=7):
    # Embed all past user messages
    past_messages = [h["content"] for h in history if h["role"] == "user"]
    
    # Get embeddings (using OpenAI API)
    response = openai.Embedding.create(
        input=past_messages + [current_message],
        model="text-embedding-3-small"
    )
    embeddings = response["data"]
    
    current_embedding = np.array(embeddings[-1]["embedding"])
    past_embeddings = np.array([e["embedding"] for e in embeddings[:-1]])
    
    # Score by relevance
    relevance_scores = cosine_similarity(
        [current_embedding],
        past_embeddings
    )[0]
    
    # Keep top N + most recent 2 turns
    top_indices = np.argsort(relevance_scores)[-max_turns:].tolist()
    top_indices.sort()  # preserve chronological order
    
    pruned = [history[i] for i in top_indices]
    pruned.extend(history[-2:])  # add last 2 turns
    
    return pruned

Why this works: a 50-turn conversation (≈12,000 tokens) becomes 7–8 relevant turns (≈2,000 tokens). The model still has context but focuses on turns that matter to the current question.

Failure mode: if all historical messages are equally relevant (e.g., iterative refinement of a design), re-ranking doesn't help. In that case, use rolling windows instead — keep only the last 10 turns, discard older ones entirely.

Technique 3: Strategic Chunking for Documents

For PDFs, research papers, or long documents, split into sections and process each independently, then synthesize results.

# Workflow: chunk → analyze → synthesize
Document: research_paper.pdf (80 pages, ≈40,000 tokens)

Step 1: Chunk by section
[Abstract] → 500 tokens
[Introduction] → 3,000 tokens
[Methods] → 4,000 tokens
[Results] → 5,000 tokens
[Discussion] → 4,000 tokens
[Conclusion] → 1,000 tokens

Step 2: Analyze each chunk
For each chunk:
  Prompt: "Extract key findings from this section."
  Input: [one section only]
  Output: 200-token summary

Step 3: Synthesize
Prompt: "Here are key findings from each section of a research paper. What is the paper's central contribution?"
Input: [all 6 section summaries, ≈1,200 tokens total]
Output: Final analysis

Cost vs. context: instead of one 40,000-token input, you make 6 smaller calls (each ≤5,000 tokens) plus one synthesis call. More API calls, but each uses less context, hitting Claude Sonnet or GPT-4o in their accuracy sweet spot. Total token consumption is similar, but accuracy improves because the model focuses on one section at a time.

What to Do Today

Pick one document or conversation you're processing with an LLM right now. Measure its token count using OpenAI's tiktoken library (Python) or Claude's token_counter. If it exceeds 50% of your model's context window, apply one technique from this article: summarize before analyzing, re-rank old conversation turns, or chunk by section. Run both versions side-by-side for one week. Track cost and accuracy. You'll see which approach fits your use case.

Batikan

April 16, 2026 · 3 min read

Topics & Keywords

Learning Lab #context window limits #long document processing #prompt engineering basics #rag alternative strategies #token management context window tokens document 000 model middle conversation 000 tokens

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

A developer claims to have reverse-engineered Google DeepMind's SynthID watermarking system using basic signal processing and 200 images. Google disputes the claim, but the incident raises questions about whether watermarking can be a reliable defense against AI-generated content misuse.

Apr 14, 2026 · 3 min read

→

How Context Window Actually Works

The Lost in the Middle Problem Is Real

Technique 1: Summarize Before Processing

Technique 2: Chunk and Re-Rank for Conversation History

Technique 3: Strategic Chunking for Documents

What to Do Today

📚 Related Articles

Stay ahead of the AI curve

Related Articles

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Connect LLMs to Your Tools: A Workflow Automation Setup

Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

10 ChatGPT Workflows That Actually Save Time in Business

Stop Generic Prompting: Model-Specific Techniques That Actually Work

Write Like a Human: AI Content Without the Robot Voice

More from Prompt & Learn

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

DeepL Adds Voice Translation. Here’s What Changes for Teams

10 Free AI Tools That Actually Pay for Themselves in 2026

Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

AI Tools That Actually Cut Hours From Your Week

Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means

Stay ahead of the AI curve