How Context Enters — An Attention Explorer

Section 1

The static problem

The previous artifact showed how words become vectors. Here is the problem: one word, one vector — always, regardless of context.

The embedding table assigns every token a fixed vector, learned during training and thereafter unchanged. The word bank gets one vector. Whether it appears in a sentence about rivers or a sentence about mortgages, the model starts with the same representation. The geometric structure explored in the previous artifact — the relationships between words, the directions that encode meaning — is all computed from those static vectors.

This matters more than it might seem. Language is saturated with context-dependence — not just obvious cases like bank, but ordinary pronouns, adjectives, and verbs whose meaning shifts depending on what surrounds them. Consider this sentence, which we'll return to throughout this artifact:

The trophy did not fit in the suitcase because it was too big.

What does it refer to? A human resolves this immediately — the trophy was too big, not the suitcase. But the static embedding for "it" carries no such information. It is the same vector whether "it" refers to the trophy, the suitcase, or something three sentences earlier. The embedding geometry has no mechanism to resolve this. Something else has to.

Static embedding

"it" = one vector

The same regardless of whether "it" refers to the trophy, the suitcase, or something mentioned three sentences earlier.

After attention

"it" = a mixture

Updated by attending to other tokens — carrying information about what "it" most likely refers to in this specific sentence.

The mechanism that produces this update is called attention. It allows every token to consult every other token in the sequence, weighting them by relevance, and incorporate their information into an updated representation.

Attention is what makes the same model that assigned "bank" a static vector produce outputs that correctly distinguish "river bank" from "savings bank." The static embedding is where processing begins; attention is how context enters.

Dig deeper The history: from RNNs to attention ›

Before attention, language models processed sequences using recurrent neural networks (RNNs) — architectures that read tokens one at a time, maintaining a "hidden state" that accumulated information from the sequence so far. The problem: by the time a long sentence was fully processed, early information had been compressed, overwritten, or lost through a bottleneck.

The first attention mechanism (Bahdanau et al., 2015)¹ was a targeted fix: instead of forcing the decoder to work from a single compressed vector, let it look back at all the encoder's hidden states simultaneously and weight them by relevance. This produced measurable improvements in translation quality, especially on long sentences.

The transformer architecture (Vaswani et al., 2017, "Attention Is All You Need")² went further: eliminate the recurrence entirely, and build representations using only attention. This meant every token could directly attend to every other token in a single parallel operation — no sequential bottleneck, no vanishing gradients through time. Modern language models are all transformer-based descendants of that design.

Section 2

Queries, keys, and values

Every token simultaneously asks a question, posts an advertisement, and prepares a contribution.

Attention is computed using three projections of each token's representation, called the query, the key, and the value. These aren't separate inputs — they're three different views of the same token, each serving a distinct function.

Query — what am I looking for?

The query encodes what kind of information this token needs from its context. A pronoun like it has a query that effectively asks: "which earlier noun is my referent?"

Key — what do I advertise?

The key encodes what kind of queries this token is a good answer for. A noun like trophy has a key that signals: "I am a concrete, physical object that could be someone's referent."

Value — what do I contribute when called upon?

The value encodes what information this token actually provides — primarily its semantic content, shaped by everything the model has learned about how that content is useful in context.

To compute attention, the model takes the dot product of each token's query against every other token's key. Tokens whose keys closely match the query receive high attention weights; others receive low weights. After a softmax normalization (making the weights sum to 1), those weights are used to compute a weighted average of the value vectors.

The result at each position is not a selection — it is a mixture. The output for it isn't "trophy" or "suitcase"; it's a blend of all value vectors in the sentence, weighted toward whichever tokens the query-key matching identified as most relevant. For a pronoun with a clear referent, that mixture is heavily weighted toward the referent. For an ambiguous pronoun, the weights may be split.

A common misconception: attention weights show where the model "focused." More precisely, they show how information was mixed — a high weight on token X means X's value contributed substantially to the updated representation. Multiple tokens can have high weights simultaneously. This is a weighted average, not a spotlight.

Why are query and key separate from value?

It might seem simpler to just use one vector per token. The separation matters because matching (what you're looking for, what you advertise) and contributing (what information you provide) can usefully differ. A token might be highly relevant to many queries while having a narrow, specific semantic contribution. The three-way decomposition gives the model more flexibility to learn these distinctions.

Dig deeper The mathematics ›

For a sequence of T tokens, each already embedded as a vector of dimension d_model, the attention computation begins by projecting every token into three spaces via learned weight matrices W_Q, W_K, W_V:

Q = X · W_Q K = X · W_K V = X · W_V

Q and K are each T × d_k matrices (one row per token) — they must share the same dimension so that QK^T is defined. V is T × d_v, where d_v can differ from d_k in principle, though in practice the original transformer sets d_v = d_k for each head. The attention output is:

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

QK^T is a T×T matrix of all pairwise query-key dot products. Dividing by √d_k prevents the values from growing large enough to push softmax into near-zero-gradient regions. Softmax is applied row-wise, producing a probability distribution per token. Multiplying by V yields T output vectors — one updated representation per token, incorporating context from across the sequence.

One important property: this computation is permutation-invariant. If you shuffled all tokens randomly, you would get the same outputs (just rearranged). The model has no intrinsic sense of word order. Positional information is added separately — as a "positional encoding" vector added to each token's embedding before any attention computation — making word-order sensitivity a separable, engineered addition to an otherwise orderless mechanism.

Dig deeper Syntax, semantics, and the Q/K/V decomposition ›

A useful (if imperfect) gloss on the Q/K/V decomposition: queries and keys tend to track syntactic relationships — which tokens need to consult which other tokens, based on grammatical structure. Values tend to carry semantic content — what information actually flows when a relationship is established.

This isn't designed in — it emerges from training. But it makes intuitive sense: the question "does this adjective modify this noun?" is structurally determined, while the information that flows from noun to adjective (or vice versa) is semantic.

The decomposition is imperfect in practice. Some heads appear to do primarily syntactic work; others primarily semantic work; others something less interpretable. The query-key similarity determines whether information flows; the value determines what information flows. But these aren't orthogonal — a token's value may encode information about its syntactic role, and its key may encode semantic properties that determine relevance.

Section 3

Attention in action

What does it actually look like when "it" finds "trophy"?

The visualizations below show real attention weights from BERT,³ a transformer model trained by Google. Select a sentence, then click any token to see which other tokens it attends to most strongly.

A note on the model: BERT differs from the models you use day-to-day (like Claude or ChatGPT) in one important way: it reads sentences in both directions simultaneously. Most production language models read left-to-right only, which affects attention patterns slightly. For the linguistic relationships shown here, the distinction doesn't change what you see — but it's worth knowing. The mechanism is the same; the directionality differs.

Select a sentence above to begin.

Dig deeper BERT vs. GPT: two architectural choices ›

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. 2018) and GPT (Generative Pre-trained Transformer, Radford et al. 2018)⁴ represent two distinct uses of the transformer architecture, developed almost simultaneously.

BERT is a bidirectional encoder: it processes the entire sequence at once, allowing every token to attend to every other token in both directions. This produces rich contextual representations well-suited for tasks that require understanding complete sentences — classification, named entity recognition, reading comprehension. BERT cannot generate text autoregressively because it has no mechanism for producing one token at a time without "seeing" future tokens.

GPT is an autoregressive decoder: it processes tokens left-to-right, with each token only able to attend to previous tokens (future tokens are masked during training). This constraint is what enables text generation — the model predicts each next token based only on what came before. In principle, a bidirectional model could generate text by jointly predicting all tokens in a fixed-length window simultaneously, but this is computationally unwieldy and not used in practice.

The models you interact with (Claude, ChatGPT, Gemini) use GPT-style autoregressive architectures. BERT-style encoders are more common in specialized NLP applications where generation isn't required.

Dig deeper Equity: whose sentences get visualized? ›

The sentences chosen for this artifact reflect a particular set of linguistic phenomena — pronoun resolution, adjective-noun modification, subject-verb agreement — that are well-studied in the English-language NLP literature. These are real and important phenomena, but they're not the only phenomena worth understanding, and they reflect assumptions about what "interesting" linguistic structure looks like.

Languages with different syntactic structures — verb-final languages like Japanese or Turkish, languages with grammatical gender agreement like French or Arabic, languages with complex evidentiality systems like many indigenous American languages — would present different attention patterns and different challenges. A pronoun disambiguation example presupposes a pronominal system like English's; many languages handle reference tracking very differently.

BERT itself was trained primarily on English Wikipedia and BookCorpus. Its attention patterns reflect the statistical structure of that corpus. Multilingual BERT (mBERT) extends this to 104 languages but with much less training data per language. The attention phenomena visualized here are real; whether they generalize to other languages and linguistic structures is an empirical question, not an assumption.

Section 4

Many heads, many relationships

One attention pattern isn't enough. Language has syntax and semantics and reference all operating simultaneously.

A single attention operation can only capture one set of token relationships at a time. But in any given sentence, multiple relationships matter simultaneously — which noun is the pronoun's referent, what verb does this adverb modify, what is the subject of this passive construction. Multi-head attention addresses this by running several independent attention operations in parallel, each with its own learned query, key, and value projections.

The outputs of all heads are concatenated and projected back to the original dimension. The model learns, through training, to use each head for different purposes — though nobody designed that specialization. It emerges because having multiple parallel attention operations allows the network to simultaneously track different structural relationships, and the training objective rewards doing so.

Seeing specialization

The panels below show three different attention heads operating on the same sentence — Mary told Jane that she would be late. Each head is looking at the same tokens but producing different attention patterns, reflecting different aspects of the sentence's structure.

The labels shown are post-hoc human interpretations, not designed behaviors. Notice that the first panel is labeled "next-token" rather than "verb→object" — because on close inspection, this head tracks immediate sequence neighbors rather than syntactic roles. The verb-object relationship and next-token proximity happen to coincide here. Distinguishing these requires looking at many sentences, not just one. This is one of the central challenges of mechanistic interpretability: the model works, but what it's actually computing often resists clean description.

The autoregressive constraint — and how thinking models address it

All of this sophisticated attention machinery still operates under a fundamental constraint: modern language models generate text one token at a time, left-to-right, without the ability to revise earlier tokens in light of later context.

When you receive a response, each word was generated based only on everything that came before it — not on what came after. A sentence that builds toward a surprise, or a paragraph that opens with a deliberate ambiguity resolved at the end, is genuinely harder for these models than for humans who plan ahead. The model is always committing to the next word without seeing the full structure it's building.

Thinking models (like Claude's extended thinking mode) substantially mitigate this constraint. Before producing visible output, the model generates a hidden reasoning trace — working through the problem autoregressively, but in a scratchpad that becomes part of the context when generating the final response. The visible output can then attend back to the completed reasoning, producing responses that feel more forward-planned. The autoregressive constraint isn't removed — it's worked around by giving the model more context to look back at.

BERT, the model used in this artifact's visualizations, is bidirectional — it can attend in both directions simultaneously. That's why it wasn't used as the example of a production language model here. What you see in Section 3 is what attention can look like when the directionality constraint is lifted. The models you use day-to-day operate under the left-to-right constraint, with thinking modes as the partial workaround.

Dig deeper Mechanistic interpretability: reading the model from the inside ›

The field of mechanistic interpretability attempts to understand what neural networks are actually computing — not just what outputs they produce, but what internal mechanisms produce them. Attention weight visualization is one tool in this effort, but it has known limitations: Jain and Wallace (2019)⁵ showed that attention weights do not always correspond to which tokens are most influential for the final output, complicating simple "the model attended to X therefore X mattered" interpretations.

More recent work has identified specific circuits — small subgraphs of attention heads and feedforward layers — that implement recognizable computations: indirect object identification, factual retrieval, in-context learning. Anthropic's interpretability team has published work on "superposition" (how models pack more features than they have dimensions by using overlapping representations) and "circuits" (how specific capabilities are implemented across layers and heads).⁶

The honest state of the field: we can identify some things some heads do in some models under some conditions. We cannot reliably read off what a model knows or believes from its internal representations. The model works; its workings remain substantially opaque. This is both a scientific challenge and a governance challenge — systems with capabilities we can't fully audit are harder to verify and harder to trust.

Dig deeper Equity: whose language structures does attention learn? ›

Multi-head attention learns to track linguistic relationships from training data. The relationships it learns best are those most consistently represented in that data. For a model trained predominantly on English text, this means English syntactic structures — subject-verb-object order, English pronoun systems, English patterns of modification — are well-represented, while structures specific to other languages may be learned poorly or not at all.

This has practical consequences. A model that has learned robust attention patterns for English pronoun resolution may perform that task reliably. The same model handling a language with different pronoun systems, or a language where reference is tracked through verb morphology rather than pronouns, may produce attention patterns that don't map onto the relevant structure. The problem isn't that attention can't represent these structures — it can — but that learning them requires training data in those languages, and that data is systematically scarcer for most of the world's languages.

The result is models that are, in a precise technical sense, better at some languages than others — not because of architectural limitations but because of data distribution choices. This is a version of the tokenization equity problem discussed in the tokenization explainer, operating at a deeper level of the architecture.

The multi-head structure makes this more precise. A multilingual model needs to develop attention heads that track syntactic relationships across all its training languages simultaneously. Some heads may specialize effectively for one language's structural patterns while being less useful for another's. A head that reliably tracks subject-verb agreement in English — where subjects typically precede verbs — may be less useful for verb-final languages like Japanese or Turkish, where the structural cues appear in different positions. The total number of heads is fixed; the specialization that emerges reflects what the training data rewarded. Languages with more training data get more specialized, more reliable head behavior. The architecture is flexible enough to represent any language's structure in principle; the training distribution determines which structures actually get learned.

Next in the series: Attention updates each token by mixing information from other tokens in the same sequence. But how much context can the model see at once? That constraint — the context window — is explored next.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of ICLR 2015. doi.org/10.48550/arXiv.1409.0473 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. doi.org/10.48550/arXiv.1706.03762 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019. doi.org/10.48550/arXiv.1810.04805 ↩
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI technical report. cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf ↩
Jain, S., & Wallace, B. C. (2019). Attention is not explanation. Proceedings of NAACL-HLT 2019. doi.org/10.48550/arXiv.1902.10186 ↩
Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy models of superposition. Transformer Circuits Thread. transformer-circuits.pub/2022/toy_model/index.html ↩

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. Attention weights were extracted from bert-base-uncased using the HuggingFace Transformers library (output_attentions=True). The model produces 12 layers × 12 heads of attention matrices per sentence — 144 distinct attention patterns. The heads shown were selected by searching for layer/head combinations where specific token pairs showed the strongest, most interpretable attention signals.

BERT was chosen over production models (Claude, GPT) because attention weights are not exposed via those models' APIs. BERT's bidirectionality means its patterns differ somewhat from what those models produce internally, but the mechanism — query-key matching, softmax weighting, value mixing — is the same. Data generated April 2026.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.