What Words Become — An Embedding Explorer

Section 1

The lookup problem

Tokens are just numbers. But numbers don't have meaning. Something has to happen between the number and the understanding.

The previous artifact showed how text gets broken into tokens — chunks assigned integer IDs. A model seeing the word trophy sees something like the number 23,847. The number is a label, not a description. It carries no information about what a trophy is, what it's used for, or how it relates to other words.

Before any language processing can happen, those integers need to become something richer. The solution is called an embedding — and understanding it changes how you think about what language models are actually doing.

The naive approaches — and why they fail

The simplest idea: just feed the integer directly into the model. But a model treating token 23,847 as a number would think it's "close to" token 23,846 and "far from" token 1. Token IDs are arbitrary labels — their numeric proximity means nothing.

A better idea: one-hot encoding. Create a vector with one slot per vocabulary item (~100,000 slots). Put a 1 in the slot for the current token, 0 everywhere else.

cat

index 4,832

kitten

index 18,201

democracy

index 62,445

One-hot encoding is unambiguous — every token gets a unique representation. But it's also enormous (100,000 dimensions), maximally sparse (one non-zero value), and relational in exactly the wrong way: cat and kitten are no more similar to each other than either is to democracy. Every word is equally distant from every other word.

What an embedding actually is

An embedding replaces the one-hot vector with a much smaller dense vector — typically 512 to 4,096 numbers — where the values are learned during training. Instead of a 100,000-slot array with one non-zero entry, each token gets a compact list of real numbers, all potentially non-zero.

These numbers are not assigned by hand. They start as random noise and get adjusted, through billions of training examples, toward values that make the model better at predicting language. The meaning of each dimension isn't specified in advance — it emerges.

The result is a table — one row per token in the vocabulary, one column per embedding dimension. When the model sees the token for trophy, it looks up that row and retrieves a dense vector of learned numbers. That vector is what the rest of the model actually works with.

Dig deeper A brief history: from symbolic to distributed representations ›

Before embeddings, the dominant approach to computational language was symbolic: words were atomic symbols, and meaning was encoded in hand-crafted rules and ontologies (think: WordNet, which manually encoded relationships like "cat is a type of animal"). This worked reasonably well for narrow tasks but was brittle and required enormous human effort to build and maintain.

The shift to distributed representations — the idea that meaning could be encoded as a pattern across many numerical dimensions rather than as a discrete symbol — began in earnest in the 1980s with connectionist models. The key insight, often attributed to Hinton (1986), was that distributed representations could capture similarity relationships automatically from data rather than requiring explicit encoding.¹

The modern embedding era is usually dated to Mikolov et al.'s Word2Vec (2013), which showed that simple training objectives on large corpora produced embedding spaces with surprisingly coherent geometric structure.² The king − man + woman ≈ queen result from that paper became the emblem of a new way of thinking about what machines could learn about language.

The embeddings in this artifact were generated using Qwen3-embedding:8b, a 2024 model from Alibaba — a full decade after Word2Vec, with far more training data and a much larger embedding dimension (4,096 vs. Word2Vec's typical 300). The geometric structure is qualitatively similar; the scale is not.

Dig deeper The mathematics of one-hot and why density helps ›

A one-hot vector for a vocabulary of size V is a member of ℝᵛ with exactly one non-zero coordinate. The dot product between any two distinct one-hot vectors is always zero — meaning all words are perfectly orthogonal to each other. There is no geometric basis for similarity.

An embedding of dimension d replaces this with a learned matrix E of shape V × d (for a 100k vocabulary at 4,096 dimensions, that's ~400 million parameters just for the embedding table). The embedding for token i is simply row i of E.

Because the values are real numbers optimized by gradient descent, the geometry is not constrained to be orthogonal. Words that appear in similar contexts end up with similar vectors — their dot products and cosine similarities become meaningful quantities rather than always-zero artifacts.

Cosine similarity — the dot product of two unit-normalized vectors — is the standard measure of embedding proximity. A value of 1.0 means the vectors point in exactly the same direction (identical usage contexts); 0.0 means orthogonal (no shared context); negative values are possible but rare in practice.

Section 2

The geometry of meaning

Words that are used similarly end up in similar places. That sounds simple. The implications are not.

Because embedding vectors are learned from co-occurrence patterns in text, words used in similar contexts end up with similar vectors — measurable as a high cosine similarity. Doctor and physician appear in nearly identical contexts; their vectors are close. Doctor and democracy appear in very different contexts; their vectors are far apart.

But the geometry goes further than simple similarity. It encodes relationships — and those relationships can be manipulated arithmetically.

The analogy explorer

Select an analogy below to see the arithmetic in action. The result vector is computed by adding and subtracting embedding vectors, then finding the nearest words from a curated vocabulary of approximately 490 words. These are not the nearest neighbors in the full model lexicon — they are the closest matches within a selected set designed to cover the relevant semantic territory.

These results come from a real embedding model (Qwen3-embedding:8b, 4,096 dimensions) running on a local machine. The arithmetic was performed on the actual vectors; no results were cherry-picked or adjusted. The similarity scores shown are cosine similarities. Note that the Paris example above intentionally shows a "wrong" answer — Milan rather than Rome — because the geometry reflects usage patterns, not just geographic facts.

What this means — and what it doesn't

The fact that king − man + woman ≈ queen is not magic, and it's not designed in. No one told the model that kings and queens are gendered counterparts. The model learned this because kings and queens appear in overlapping contexts that differ along a consistent dimension — the same dimension that separates "man" contexts from "woman" contexts.

Three important caveats before you get too comfortable with this picture:

It's approximate, not algebraic. The arithmetic works well for some relationships and fails for others. Try the Paris − France + Italy example in the explorer: it returns Milan before Rome, because Paris and Milan share a cultural association (fashion, arts) that the geographic relationship doesn't fully override. The geometry reflects usage patterns, not facts.

Dimensions aren't interpretable labels. It's tempting to imagine that one dimension means "animacy" and another means "royalty." In practice, semantic features are distributed across many dimensions simultaneously, and the axes aren't aligned with human categories. What's interpretable is directions in the space — and those require looking at many vectors at once, not individual coordinates.

This is a static embedding. Each word has exactly one vector regardless of context. The word bank — whether riverbank or financial institution — gets the same vector. This is a fundamental limitation we'll address in the next artifact.

Dig deeper How does training actually produce this geometry? ›

Most embedding models are trained on a prediction task: given the surrounding words in a sentence, predict the word in the middle (or vice versa). The embedding vectors are the parameters of a small neural network trained to do this prediction well.

The key insight is that two words are interchangeable in prediction — have similar "surrounding word contexts" — if and only if they mean similar things. Doctor and physician appear after "the", before "said", in sentences about hospitals and patients, and so on. The prediction task forces their vectors to converge.

The geometric regularity (king − man + woman ≈ queen) falls out because the training data contains consistent parallel structures: sentences about kings and sentences about queens differ in the same ways that sentences about men and women differ. The model finds it useful to represent this as a consistent direction in the vector space.

This also explains why the results depend on the training corpus. A model trained only on medical literature would have very different geometry for everyday words than a model trained on the web. The geometry is not a property of language in the abstract — it's a property of the specific text the model learned from.

Dig deeper Linguistic note: what kinds of relationships get encoded? ›

Not all linguistic relationships survive the embedding process equally. The analogy examples above show several types:

Lexical gender pairs (king/queen, emperor/empress) are among the cleanest, because gendered variants appear in highly parallel contexts. Languages with grammatical gender (French, Spanish, Russian, Arabic) produce particularly strong gender directions — every noun has a gendered form, creating systematic parallelism throughout the corpus.

Geographic relationships (capital/country) work well when the relationship is consistent in the training data. They degrade when polysemy interferes — "Paris" appears in fashion, film, and geography contexts, creating a blended vector that doesn't sit cleanly in the "European capitals" neighborhood. The Paris→Milan result is a direct illustration of this.

Morphological relationships (walk/walking, good/better) are often the cleanest of all, because verb inflection and comparative adjective formation are highly systematic. The "-ing direction" is strikingly consistent across verbs in the explorer above.

What tends to not get encoded cleanly: metaphorical meaning, pragmatic implication, and culturally specific associations that vary across the training corpus. The embedding captures the statistical center of how a word is used, not the full range of its uses.

Section 3

What the geometry reveals — and hides

If meaning is geometry, then bias is geometry too. And geometry can be measured.

The same arithmetic that finds queen from king − man + woman can be pointed at more uncomfortable questions. How close is "engineer" to "man" versus "woman"? Do those distances differ? Do they reflect something real about how language is used — or something real about how society is structured?

Occupation and gender: a sorted distance chart

The chart below shows cosine similarity between "man" and "woman" and a range of occupations, sorted from most male-proximate to most female-proximate. Both bars use the same absolute scale so bar lengths are directly comparable across rows.

similarity to "man"

similarity to "woman"

Several occupations show near-zero gaps — professor (gap: +0.003), judge (+0.001), teacher (−0.005). These near-ties are as interesting as the strong gaps. They may reflect genuine change in how these professions are discussed — or they may reflect conflicting biases that cancel each other out. The geometry doesn't tell you which.

The analogy as probe: pilot → stewardess

The gender-occupation bias also shows up in the analogy arithmetic. When we compute pilot − man + woman, the nearest result is stewardess — with flight attendant also in the top five. The female-associated role in the same industry floats to the top.

Note that "pilot" itself has one of the stronger male-proximate gaps in the chart above (0.048). The analogy result compounds this — subtracting male-gendering and adding female-gendering to an already male-proximate occupation produces a female-coded job in the same field rather than a female pilot. The geometry has learned something about how these professions have been discussed — not how they need to be.

Dig deeper The research this replicates ›

The occupation-gender bias probe above is a partial replication of findings from two landmark papers:

Bolukbasi et al. (2016)³ demonstrated that word2vec embeddings trained on Google News encoded systematic gender-occupation associations that tracked cultural stereotypes. Their finding that engineer − man + woman ≈ homemaker became a widely cited example of learned bias. The same pattern appears here with a 2024 model.

Caliskan et al. (2017)⁴ showed that embedding-space associations paralleled human implicit association test (IAT) results — including racial and gender associations — suggesting that language models learn not just linguistic patterns but the full structure of cultural association present in their training data.

The results above use a 2024 model (Qwen3-embedding:8b) rather than the original word2vec. The gender gaps are present but generally smaller than in older models — possibly reflecting more balanced training data, deliberate mitigation efforts, or both. The direction of the biases is consistent with the earlier findings.

An honest caveat: embedding-space bias does not map directly onto output bias in production models. Fine-tuned models like Claude or GPT-4 undergo additional training (RLHF — reinforcement learning from human feedback) that can suppress certain output patterns even when the underlying geometry still encodes them. The relationship between geometric bias and behavioral bias is real but not simple.

Beyond gender: co-occurrence and what it does — and doesn't — mean

Before looking at a more charged example, consider this: the word "teacher" has similarly high cosine similarity with both "educated" (0.707) and "uneducated" (0.691) — nearly tied. "Judge" sits almost equidistant from "man" and "woman." These near-ties arise because enormous amounts of text discuss these topics from many angles simultaneously. High similarity between two words doesn't mean the model has formed a judgment about their relationship; it means those words appear together in the training data, for whatever reason and in whatever context.

The same logic applies — with higher stakes — to the relationship between demographic terms and words like "criminal." The chart below shows cosine similarity between "criminal" and several demographic and socioeconomic anchors.

The geometric signal is real. But what does it mean? Discussions of criminality and discussions of certain groups co-occur in training data for many reasons simultaneously: news reporting on incarceration statistics, academic analysis of racial bias in policing, historical documentation of racial injustice, and unfortunately also racialized discourse that invokes these associations directly. A static embedding model has no mechanism to distinguish these contexts. All of it feeds the same co-occurrence signal.

And here is the critical practical consequence: a text generation model built on static embeddings will perpetuate those co-occurrences regardless of context. When generating text that involves either "black" or "criminal," the model's geometry will push these words toward each other — not because the model has a racist intent, but because the co-occurrence is encoded in its representations. A sentence discussing racial bias in the criminal justice system will be shaped by the same geometry as a sentence making a racial stereotype. The geometry does not know the difference.

This is why the bias findings of Bolukbasi, Caliskan, and others matter beyond academic interest. The embedding geometry is not a neutral substrate that downstream systems can freely reinterpret. It shapes what gets generated, what associations get reinforced, and whose language the model treats as the unmarked default.

Modern fine-tuned models (Claude, GPT-4) apply additional training that suppresses some of these output patterns. But that suppression is incomplete and context-dependent. The geometric bias is still present in the underlying representations — it has been partially masked, not removed.

Dig deeper Equity implications and the limits of debiasing ›

Once researchers demonstrated that embedding spaces encode cultural biases, an obvious next question was: can you remove them? Several debiasing approaches have been proposed — projecting out the "gender direction," rebalancing training data, adding neutralization objectives during training.

The results have been mixed and contested. Gonen and Goldberg (2019)⁵ argued that many debiasing approaches create an illusion of fairness by moving biased associations slightly without eliminating them — the geometry shifts but the underlying structure remains detectable with more sensitive probes. The debate over what counts as genuine debiasing — versus cosmetic change — is ongoing.

A deeper critique: the framing of "bias" as something to be removed from an otherwise neutral system assumes that an unbiased representation of language is possible and well-defined. Meshell Sturgis (Race and Communication, UNM) has raised a version of this point in our ALFF group: if AI systems are built on text that reflects existing social structures, "debiasing" within that framework is different from questioning the framework itself. Efficiency at representation is not the same as equity of representation.

These questions don't have settled answers. What embedding geometry gives us is a measurable entry point into discussions that are otherwise hard to quantify — which is both its value and its risk.

Section 4

The limit of this picture — and what comes next

Everything you have just seen is static. "Bank" has one vector. That is about to change.

The embedding table maps each token to a fixed vector, learned during training and thereafter unchanged. When the model sees bank in "the river bank was steep," it retrieves the same vector it uses for bank in "the bank approved the loan." The geometry you've been exploring doesn't distinguish them.

This is not a minor limitation. Many of the most interesting things about language — disambiguation, metaphor, irony, reference — depend on context. A model that assigns fixed vectors to tokens can approximate context-sensitivity by training on enormous amounts of text, but it cannot represent the same word meaning different things in different sentences.

Static embedding

"bank" = one vector

The same regardless of whether it's a riverbank, a financial institution, or a blood bank.

Contextual representation

"bank" = many vectors

Updated by surrounding tokens through attention. Different sentences produce genuinely different representations.

The mechanism that produces contextual representations — attention — is the subject of the next artifact. What's worth noting here is the relationship between the two: the static embedding is where every token starts. Attention is how each token's representation gets updated by what surrounds it.

The embedding geometry you've seen is real, but it's the geometry of the input layer. By the time a modern transformer has processed a sentence through a dozen attention layers, the representations have been transformed far beyond their starting points. What emerges from those layers is contextual, relational, and considerably harder to visualize — but it's built on top of the foundation you now understand.

Dig deeper ELMo and the bridge to contextual embeddings ›

The problem of context-insensitive embeddings was well recognized before transformers. The first widely-adopted solution was ELMo (Embeddings from Language Models, Peters et al. 2018),⁶ which produced different embeddings for the same word in different sentences by running a bidirectional language model and using its internal states as representations.

ELMo demonstrated that contextual representations dramatically outperformed static ones on a range of NLP tasks — the same word really does mean different things in different contexts, and a model that can represent that distinction performs better. It was an important proof of concept for the approach that transformers would take further.

The difference between ELMo and transformers is architectural: ELMo used recurrent networks (LSTMs) to build contextual representations, which meant sequential processing and limited parallelism. Transformers replaced this with attention, which can be computed in parallel across the entire sequence — enabling the scale of training that makes modern language models possible.

Dig deeper Exploring embedding spaces yourself ›

If you want to explore embedding spaces beyond this artifact, several accessible tools exist:

The TensorFlow Embedding Projector (projector.tensorflow.org) allows visualization of pre-trained embeddings in 2D and 3D using dimensionality reduction. It's interactive and requires no coding.

The Gensim Python library provides access to pre-trained Word2Vec and GloVe vectors with simple APIs for similarity queries and analogy arithmetic. The GloVe vectors are those used in the original Bolukbasi et al. bias research.

If you have a local LLM running via Ollama, embedding models (such as qwen3-embedding:8b, the model used in this artifact) can be queried directly via API to produce embeddings for any text you choose. The data in this artifact was generated that way.

A caution for independent exploration: the results you find will depend heavily on which model and which training data you use. Comparing embedding geometries across models can be as informative as examining any single model in isolation.

Next in the series: The mechanism that updates static embeddings into contextual representations is explored in the attention explainer. Embeddings and attention are best read as a pair.

References

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. doi.org/10.1038/323533a0 ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. doi.org/10.48550/arXiv.1301.3781 ↩
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29. doi.org/10.48550/arXiv.1607.06520 ↩
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. doi.org/10.1126/science.aal4230 ↩
Gonen, H., & Goldberg, Y. (2019). Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. Proceedings of NAACL-HLT 2019. doi.org/10.48550/arXiv.1903.03862 ↩
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL-HLT 2018. doi.org/10.48550/arXiv.1802.05365 ↩

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. The embedding data was generated by running qwen3-embedding:8b locally via Ollama against a curated vocabulary of approximately 490 words. Cosine similarities and analogy arithmetic were computed in Python using NumPy.

The bias data represents a partial replication of Bolukbasi et al. (2016) and Caliskan et al. (2017) using a 2024 embedding model. All numbers shown are from actual model outputs; nothing was adjusted to produce cleaner results. The Paris → Milan result in the analogy explorer is intentionally left as-is. Data generated April 2026.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.