Skip to content
Learning Lab · 3 min read

Context Window Management: Processing Long Docs Without Losing Data

Context window limits break production AI systems. Learn three concrete techniques to handle long documents and conversations without losing data or burning API costs.

Context Window Management for Long Documents & Conversations

Your API call completes. Claude or GPT-4o returns an answer. But somewhere in the middle of your 8,000-word document, it stopped paying attention. Not because the model broke — because you ran out of context window.

Context window is the maximum number of tokens an LLM can process in a single request. Claude 3.5 Sonnet handles 200,000 tokens. GPT-4o handles 128,000. Llama 3 70B handles 8,000. Go over that limit, your request fails. Stay under but cram too much, the model’s attention degrades on material buried in the middle — a phenomenon called the “lost in the middle” problem.

This isn’t a theoretical limitation. It breaks real production systems: customer support chatbots that can’t remember early conversation turns, document analysis pipelines that miss critical sections, and research workflows that choke on PDFs.

How Context Window Actually Works

Every word, number, punctuation mark, and whitespace gets converted to tokens before the model processes it. One token ≈ 4 characters in English, but varies by language and structure.

A 200,000-token Claude Sonnet window breaks down like this:

  • System prompt: 500 tokens
  • User input (your document): 150,000 tokens
  • Conversation history: 30,000 tokens
  • Reserved for output: 19,500 tokens

You’ve got 19,500 tokens left for the model’s response. If you need a detailed analysis, that’s enough. If you need multiple reasoning steps, you’re cutting it close.

The math is rigid: input tokens + output tokens ≤ context window. Exceed it, and most API providers reject the request with a 400 error. Some services queue it. None of them silently truncate.

The Lost in the Middle Problem Is Real

In September 2023, researchers at MIT tested whether LLMs actually use all the context they claim to support. They inserted a key fact into different positions of a long document and asked the model to retrieve it.

The finding: models perform best on information at the beginning and end of context. Information in the middle — positions 40–60% through the document — gets processed with 25–35% lower accuracy than the same information at the start.

This isn’t a Claude or GPT-4o problem specifically. It affects all transformer-based models. The reason: attention patterns in language models weight earlier tokens more heavily by default, and the model “saves” capacity for the final summary and response.

Practical impact: if your customer support bot processes a 5-message conversation, early messages get degraded treatment. If your document analyzer processes a 50-page PDF, pages 20–30 become invisible.

Technique 1: Summarize Before Processing

Instead of sending the entire document, compress it first.

# Bad approach: send full document
User: "Analyze this 30-page contract. What are the key obligations?"
[send entire 30-page contract as input]

The model uses valuable context window on boilerplate sections that don’t matter.

# Improved approach: two-stage process
Step 1: Summarize the document
Prompt: "Summarize this contract in 500 tokens. Keep obligations, timeline, and payment terms. Remove boilerplate."
[send full contract]
Output: 500-token summary

Step 2: Analyze the summary
Prompt: "Based on this summary, list all counterparty obligations and which party bears each risk."
[send the 500-token summary]
Output: Structured analysis

Why this works: you use context window on the first call to extract signal, then process only the signal on the second call. The second call is faster, cheaper, and more accurate because the model works with distilled information.

Real token savings: a 50-page contract (≈25,000 tokens) becomes a 500-token summary. Your second analysis call drops from 25,500 tokens to 1,000.

Technique 2: Chunk and Re-Rank for Conversation History

Long conversations are the hardest context problem because every new message appends to history. After 15 exchanges, you’ve consumed 8,000–15,000 tokens just on conversation memory.

# Problem: conversation history bloats
Conversation turn 20:
System: [original system prompt]
User: [turn 1]
Assistant: [response]
User: [turn 2]
Assistant: [response]
... [turns 3–19] ...
User: [turn 20] <- new message
Assistant: [model responds]

By turn 20, the model has seen 15+ irrelevant exchanges before reaching the current question. By turn 50, context is mostly dead weight.

Solution: use a re-ranking approach.

After every 8–10 turns, score each historical message by relevance to the current conversation thread using embeddings or a lightweight language model. Keep only the top 5–7 most relevant past turns, plus the 2 most recent turns. Discard the rest.

import openai
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def prune_conversation_history(history, current_message, max_turns=7):
    # Embed all past user messages
    past_messages = [h["content"] for h in history if h["role"] == "user"]
    
    # Get embeddings (using OpenAI API)
    response = openai.Embedding.create(
        input=past_messages + [current_message],
        model="text-embedding-3-small"
    )
    embeddings = response["data"]
    
    current_embedding = np.array(embeddings[-1]["embedding"])
    past_embeddings = np.array([e["embedding"] for e in embeddings[:-1]])
    
    # Score by relevance
    relevance_scores = cosine_similarity(
        [current_embedding],
        past_embeddings
    )[0]
    
    # Keep top N + most recent 2 turns
    top_indices = np.argsort(relevance_scores)[-max_turns:].tolist()
    top_indices.sort()  # preserve chronological order
    
    pruned = [history[i] for i in top_indices]
    pruned.extend(history[-2:])  # add last 2 turns
    
    return pruned

Why this works: a 50-turn conversation (≈12,000 tokens) becomes 7–8 relevant turns (≈2,000 tokens). The model still has context but focuses on turns that matter to the current question.

Failure mode: if all historical messages are equally relevant (e.g., iterative refinement of a design), re-ranking doesn't help. In that case, use rolling windows instead — keep only the last 10 turns, discard older ones entirely.

Technique 3: Strategic Chunking for Documents

For PDFs, research papers, or long documents, split into sections and process each independently, then synthesize results.

# Workflow: chunk → analyze → synthesize
Document: research_paper.pdf (80 pages, ≈40,000 tokens)

Step 1: Chunk by section
[Abstract] → 500 tokens
[Introduction] → 3,000 tokens
[Methods] → 4,000 tokens
[Results] → 5,000 tokens
[Discussion] → 4,000 tokens
[Conclusion] → 1,000 tokens

Step 2: Analyze each chunk
For each chunk:
  Prompt: "Extract key findings from this section."
  Input: [one section only]
  Output: 200-token summary

Step 3: Synthesize
Prompt: "Here are key findings from each section of a research paper. What is the paper's central contribution?"
Input: [all 6 section summaries, ≈1,200 tokens total]
Output: Final analysis

Cost vs. context: instead of one 40,000-token input, you make 6 smaller calls (each ≤5,000 tokens) plus one synthesis call. More API calls, but each uses less context, hitting Claude Sonnet or GPT-4o in their accuracy sweet spot. Total token consumption is similar, but accuracy improves because the model focuses on one section at a time.

What to Do Today

Pick one document or conversation you're processing with an LLM right now. Measure its token count using OpenAI's tiktoken library (Python) or Claude's token_counter. If it exceeds 50% of your model's context window, apply one technique from this article: summarize before analyzing, re-rank old conversation turns, or chunk by section. Run both versions side-by-side for one week. Track cost and accuracy. You'll see which approach fits your use case.

Batikan
· 3 min read
Share

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies.

Related Articles

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management
Learning Lab

Building AI Agents: Architecture Patterns, Tool Calling, and Memory Management

Learn how to build production-ready AI agents by mastering tool calling contracts, structuring agent loops correctly, and separating memory into session, knowledge, and execution layers. Includes working Python code examples.

· 5 min read
Connect LLMs to Your Tools: A Workflow Automation Setup
Learning Lab

Connect LLMs to Your Tools: A Workflow Automation Setup

Connect ChatGPT, Claude, and Gemini to Slack, Notion, and Sheets through APIs and automation platforms. Learn the trade-offs between models, build a working Slack bot, and automate your first workflow today.

· 5 min read
Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique
Learning Lab

Zero-Shot vs Few-Shot vs Chain-of-Thought: Pick the Right Technique

Zero-shot, few-shot, and chain-of-thought are three distinct prompting techniques with different accuracy, latency, and cost profiles. Learn when to use each, how to combine them, and how to measure which approach works best for your specific task.

· 15 min read
10 ChatGPT Workflows That Actually Save Time in Business
Learning Lab

10 ChatGPT Workflows That Actually Save Time in Business

ChatGPT saves hours when you give it structure and clear constraints. Here are 10 production workflows — from email drafting to competitive analysis — that cut repetitive work in half, with working prompts you can use today.

· 6 min read
Stop Generic Prompting: Model-Specific Techniques That Actually Work
Learning Lab

Stop Generic Prompting: Model-Specific Techniques That Actually Work

Claude, GPT-4o, and Gemini respond differently to the same prompt. Learn model-specific techniques that exploit each one's strengths—with working examples you can use today.

· 2 min read
Write Like a Human: AI Content Without the Robot Voice
Learning Lab

Write Like a Human: AI Content Without the Robot Voice

AI-generated content defaults to averaging—safe, professional, and indistinguishable. Learn four techniques to inject real voice into your outputs: specificity constraints, pattern matching from your own writing, temperature tuning, and the constraint-audit pass that removes robotic patterns.

· 5 min read

More from Prompt & Learn

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared
AI Tools Directory

Figma AI vs Canva AI vs Adobe Firefly: Design Tools Compared

Figma AI, Canva AI, and Adobe Firefly take different approaches to generative design. Figma prioritizes seamless integration; Canva prioritizes speed; Firefly prioritizes output quality. Here's which tool fits your actual workflow.

· 4 min read
DeepL Adds Voice Translation. Here’s What Changes for Teams
AI Tools Directory

DeepL Adds Voice Translation. Here’s What Changes for Teams

DeepL announced real-time voice translation for Zoom and Microsoft Teams. Unlike existing solutions, it builds on DeepL's text translation strength — direct translation models with lower latency. Here's why this matters and where it breaks.

· 3 min read
10 Free AI Tools That Actually Pay for Themselves in 2026
AI Tools Directory

10 Free AI Tools That Actually Pay for Themselves in 2026

Ten free AI tools that actually replace paid SaaS in 2026: Claude, Perplexity, Llama 3.2, DeepSeek R1, GitHub Copilot, OpenRouter, HuggingFace, Jina, Playwright, and Mistral. Each tested across real workflows with realistic rate limits, accuracy benchmarks, and cost comparisons.

· 9 min read
Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works
AI Tools Directory

Copilot vs Cursor vs Windsurf: Which IDE Assistant Actually Works

Three coding assistants dominate 2026. Copilot stays safe for enterprises. Cursor wins on speed and accuracy for most developers. Windsurf's agent mode actually executes code to prevent hallucinations. Here's how to pick.

· 4 min read
AI Tools That Actually Cut Hours From Your Week
AI Tools Directory

AI Tools That Actually Cut Hours From Your Week

I tested 30 AI productivity tools across writing, coding, research, and operations. Only 8 actually saved measurable time. Here's which tools have real ROI, the workflows where they win, and why most "AI productivity tools" fail.

· 12 min read
Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means
AI News

Google’s AI Watermarking System Reportedly Cracked. Here’s What It Means

A developer claims to have reverse-engineered Google DeepMind's SynthID watermarking system using basic signal processing and 200 images. Google disputes the claim, but the incident raises questions about whether watermarking can be a reliable defense against AI-generated content misuse.

· 3 min read

Stay ahead of the AI curve

Weekly digest of the most impactful AI breakthroughs, tools, and strategies. No noise, only signal.

Follow Prompt Builder Prompt Builder