The Context Window — An Explorer

Section 1

A stage you set, not a bucket you fill

Every conversation with an LLM takes place on a finite stage. What you put on that stage — and where you put it — shapes everything the model can say.

When you send a message to an LLM, the model doesn't simply receive your words and respond. It processes a larger construction — a context window — that may include your current message, the history of your conversation, instructions set by the platform, retrieved documents, and more. Everything the model can "see" when generating its response lives within this window.

The context window is measured in tokens — the units of text that LLMs process internally, covered in detail in the tokenization explainer. A token is roughly three-quarters of an English word on average. Current models vary enormously in how large a context window they support: from a few thousand tokens to over a million.

But the window is not simply "your question." It is an assembled space with multiple contributors — and the model's response has to fit within whatever budget remains after all those contributors have taken their share.

What fills a context window

System prompt

Memory

Conv. history

Documents

Your message

Response

System prompt

Platform memory

Conversation history

Retrieved documents

Your current message

Response space

Proportions are illustrative. Actual distribution varies by platform, conversation length, and task.

This artifact explores what happens when that finite space is pushed to its limits — and what happens even when it isn't.

Dig deeper Token counts vary by language ›

The tokenization artifact covered this in depth, but it is worth recalling here: a context window that holds 8,000 English tokens may effectively hold far fewer words in morphologically complex languages like Finnish or Turkish, or in logographic writing systems like Chinese or Japanese. The window is the same size in tokens; the amount of meaningful content it can hold is not.

This means a user writing in Japanese is working with a functionally smaller context than an English-speaking user with identical settings — not because the platform discriminates, but because tokenization efficiency varies by language. Context window limitations are not language-neutral.

Dig deeper Context window sizes across models ›

Context window sizes have grown dramatically. Early transformer models (2017–2019) typically supported 512–2,048 tokens. By 2023, 8,000–32,000 token windows were common. By 2025–2026, leading models offer 200,000 tokens (Claude) to over 1 million tokens (Gemini). The practical question is not just whether content fits, but whether the model can effectively use content distributed across such a large window — which is what Sections 2 and 3 examine.

Section 2

Size and its limits

When a document exceeds the context window, the model reads the end — not the beginning. This is almost always the opposite of what users expect.

The most intuitive failure mode for context windows is simple overflow: you give the model more text than it can hold. But how overflow is handled is counterintuitive. When a document exceeds the context window, many models and platforms retain the most recent tokens and discard the earliest. The model reads the end of the document, not the beginning.

Our demonstration text

The Ballad of Reading Gaol (Oscar Wilde, 1898)² — written after Wilde's release from Reading Gaol, where he was imprisoned for "gross indecency" rooted in the criminalization of his homosexuality. The poem is approximately 5,215 tokens long, divided into six unequal parts. Its varied structure and rich content make it possible to observe what different context window sizes allow a model to see and say.

Token distribution across the poem's six parts:

I (747)II (625)III (1,803)IV (1,085)V (814)VI (141)

Read the full poem at the Poetry Foundation →

What three context window sizes produced

We asked a local LLM (Qwen3:14b via Ollama) to summarize each of the poem's six sections in one sentence, at three context window sizes.

2,048 tokens ~39% of poem in context (end) Confabulated

"Part I: The speaker describes the harsh and dehumanizing conditions of prison life… Part II: The speaker reflects on the cruelty of the legal system…"

Six plausible-sounding summaries invented from near-zero poem content. Only the ending was in context; the early sections were fabricated.

4,096 tokens ~79% of poem in context (end) Shifted

"I. The speaker reflects on the harsh reality of prison life… III. The speaker portrays the grim aftermath of the execution…"

Content present but section attributions shifted by approximately one section. What it calls Part III is actually Part IV's material.

8,192 tokens Full poem with room to spare Accurate

"Part I: The speaker describes the execution of a man who killed his lover… Part III: The speaker and fellow prisoners endure the torment of waiting…"

Accurate section-by-section summary. Full poem available; all six parts correctly identified.

The 2,048 response is not obviously wrong to someone unfamiliar with the poem. It reads as competent literary analysis. This is the characteristic failure mode: confident confabulation that sounds right. The model fills the gaps with plausible-sounding material rather than acknowledging what it doesn't have.

Dig deeper The character identification failure ›

We also asked the same models to trace how the speaker's relationship to the condemned man develops across the poem. The 8,192 response correctly identified the arc and retrieved the key line deep in Part III: "And all the woe that moved him so / That he gave that bitter cry, / And the wild regrets, and the bloody sweats, / None knew so well as I."

The 2,048 response fabricated a line that does not exist in the poem: "I passed, with a heavy heart, the place where he was led / And saw the dead man's face, and the living man's head." Delivered without any uncertainty marker.

The 4,096 response made a different error: it described the speaker as "directly addressing the condemned man, now referred to as 'the man in red who reads the Law.'" In fact, the man in red is the judge — the figure who sentences the condemned man to death. The poem's power dynamic was inverted. This error appears in fluent analytical prose and would likely pass casual reading.

Dig deeper Thematic argument and impoverished interpretation ›

We asked each model to describe the argument the poem makes about the prison system. All three responses produced coherent literary analysis — but at 2,048 tokens, the response was built almost entirely from Part V material (the most explicitly argumentative section: the "brackish water," the "poison weeds," the "bricks of shame").

What the 2,048 response lacked was the narrative of Parts I–IV that gives Part V its emotional weight. It analyzed the conclusion without access to the story that earns that conclusion. The failure is not factual error but impoverishment — a reading that is technically defensible but misses what the poem is doing. This is harder to catch than a fabricated quote, and in some ways more insidious.

Section 3

Where you put things matters

Once the full document fits in the context window, you might expect the model to have reliable access to all of it. It doesn't.

In 2023, researchers at Stanford published a finding¹ that became known as the "lost in the middle" effect. Across multiple models and tasks, they found that performance was highest when relevant information appeared at the very beginning or end of the context, and degraded significantly when it sat in the middle — even for models with explicitly extended context windows. The relationship follows a U-shape: strong at the edges, weak in the center.

Dig deeper Primacy and recency effects in human memory ›

The U-shaped pattern will be familiar to anyone who has studied human memory. The serial position effect — first documented by Hermann Ebbinghaus in the late nineteenth century — describes how people tend to remember items at the beginning of a list (the primacy effect) and at the end (the recency effect) better than items in the middle. The effect is robust across a wide range of tasks and populations.

The parallel between human and model memory is pedagogically useful, though the underlying mechanisms are likely quite different. Human primacy and recency effects arise from the interplay of long-term memory consolidation and the limits of working memory. LLM attention patterns emerge from training dynamics and architectural choices. The surface resemblance may or may not reflect a deeper connection.

Dig deeper Why this might happen in LLMs ›

One plausible explanation involves how instruction fine-tuning works. These models are trained on datasets where the task specification is almost always placed at the beginning of the input — so models may weight early content more heavily, a pattern baked in through training rather than architectural necessity.

It is worth noting that the effect appears to be diminishing in the most recent frontier models. Gemini 2.5 Flash, for instance, shows high accuracy in needle-in-a-haystack tasks regardless of document position. The finding describes a real and documented phenomenon, but the degree to which it affects any given model in any given task is an empirical question — not a settled law.

Our confirmation: all four window sizes failed

We tested the lost-in-the-middle effect using a factual retrieval question about Part III — the night before the execution, which sits deep in the middle of the poem's longest section. The correct answer contains the Chaplain "robed in white," the Sheriff "stern with gloom," the Governor "with the yellow face of Doom," and the hangman slipping through the padded door.

I

II

III

IV

V

VI

target passage

2,048Wrong

Retrieved Part VI burial imagery — never had access to Part III.

4,096Wrong

Retrieved a Part IV morning scene — a plausible dawn, but the wrong one.

8,192Wrong

Retrieved Part I: "the prison walls suddenly seemed to reel…" — full poem in context; wrong scene retrieved.

16,384Wrong

Exact same wrong answer as 8,192 — word for word. ~11,000 tokens of headroom above poem length.

Key result

At 16,384 tokens, the model had the entire poem in context with approximately 11,000 tokens of headroom. It still retrieved the wrong scene — a Part I passage with similar emotional register to the correct Part III passage. The correct answer sits in the middle of the poem's longest section. Even with abundant context, the model bypassed it.

The interactive: six questions, four windows

The explorer below lets you compare model responses across six questions targeting different positions in the poem, at four different context window sizes. Notice how questions about the early poem fail at small window sizes (front-truncation), while Q3 fails even at the largest window (lost in the middle).

Factual Retrieval Explorer — The Ballad of Reading Gaol

interactive

Left column:

Right column:

Poem map — shaded region = in context. Solid blue = left column, dashed brown = right column.

I

II

III

IV

V

VI

Question — Part I (token ~200 of 5,215) What does the voice behind the speaker whisper when he is wondering about the condemned man?

2,048 tokens Wrong

The voice whispers: "All men kill the thing they love."

Part VI material — the poem's conclusion. The 2,048 window retained the end of the poem; the opening was never in context.

16,384 tokens Correct

The voice whispers: "That fellow's got to swing."

Correct. Full poem in context with ample headroom. Part I material reliably accessible.

Section 4

The stage you don't see

Platforms construct context on your behalf, invisibly, through memory systems you may not know exist — and the decisions about what to retain and how to represent it are not yours to make.

What our demonstrations didn't show

The demonstrations in Sections 2 and 3 isolated context window size as cleanly as possible. Real interactions are messier. The context window includes the platform's system prompt, any memory constructed from previous conversations, retrieved information, and the conversation history itself. Each of these consumes tokens from the same finite budget.

The demonstrations also couldn't show the effect of where you place information in your own prompts. A user who front-loads key constraints and context places that information in a position the model weights more heavily. A user who buries the most important instruction in the middle of a long prompt may find it under-weighted — for the same positional reasons Section 3 documented. How you design your context matters. That is a topic for a future artifact in this series, in the guides subseries on effective prompting.

Platform memory: the context you didn't write

Modern AI platforms construct persistent context — memory — that carries across sessions. This memory occupies space in the context window, and the decisions about what to remember, how to summarize it, and when to surface it are made by the platform, not the user.

The Claude platform offers a useful worked example because it makes these systems relatively explicit:

Scope: all conversations

User preferences

A manually editable space for persistent preferences — tone, format, areas of expertise. Intended as a user-controlled layer. Always present in the context window.

Scope: general chats

General memory

Automatically generated summaries of conversations outside of Projects. The system distills what it considers salient. Users typically cannot see exactly what has been retained or how it was summarized.

Scope: per project

Project memory

Each Project maintains its own memory context, isolated from general memory. Rich project-specific context can accumulate without carrying over to other conversations.

When a platform summarizes your conversation history and injects that summary into future context windows, it is making editorial choices — about what is salient, what can be compressed, what can be dropped. Those choices are not neutral, and they are largely invisible to users. The model you interact with tomorrow is working from a stage that was set, in part, by an automated process you didn't observe.

Dig deeper Memory systems and potential bias ›

Platform memory systems make choices about what to retain and how to characterize users. If those choices systematically misrepresent certain users — by over-emphasizing particular attributes, by failing to capture cultural or linguistic context, by applying categories that don't fit — the effect compounds over time. Each conversation begins from a slightly distorted representation.

We are not aware of dedicated empirical research on whether current platform memory systems introduce systematic bias by demographic group, linguistic background, or cultural context. The opacity of these systems makes such research difficult to conduct. This is an open question, not a settled finding — but it is a question worth asking.

Dig deeper RAG and retrieval-augmented generation ›

Many production AI systems extend the effective context window through Retrieval-Augmented Generation (RAG)³: rather than loading an entire document corpus into context, the system retrieves relevant passages on demand and injects them. This allows systems to work with knowledge bases far larger than any context window could hold.

RAG systems have their own positional dynamics: retrieved passages are injected at a specific location in the context window, and their placement affects retrieval quality in the same ways the lost-in-the-middle research documents. A well-designed RAG system places the most relevant retrieved content at the edges of the context window, not the middle.

Next in the series: A different axis of what shapes the model's output — not where information sits in context, but how the model chooses between possible next tokens — is explored in the temperature explainer.

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. doi.org/10.48550/arXiv.2307.03172 ↩
Wilde, O. (1898). The Ballad of Reading Gaol. Leonard Smithers. Poetry Foundation. ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33. doi.org/10.48550/arXiv.2005.11401 ↩

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. All model responses in Sections 2 and 3 were generated using qwen3:14b running locally via Ollama on consumer hardware (RTX 4070, 32GB RAM). Context window sizes were set manually, thinking mode was disabled, and each question was run in a clean chat to prevent history from consuming context budget.

An honest note on process: early in testing, we assumed context overflow would discard the end of the input rather than the beginning. The data immediately contradicted this — small context windows produced correct answers about the poem's ending but wrong answers about its beginning. The reversal changed how we interpreted every Section 2 result. We've mentioned this here because catching one's own misreading of the data is itself the kind of critical engagement this program aims to develop. Data generated April 2026.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.