How LLMs Read — A Tokenization Explorer

Section 1

LLMs don't read words. They read chunks.

Before a language model can process a single sentence, it must first break that sentence into pieces — and those pieces are not words.

When you send a message to an AI, the first thing that happens is your text is split into tokens — small chunks that might be whole words, parts of words, punctuation marks, or even individual bytes. The model never sees your actual characters; it sees a sequence of numbers representing these chunks.

Here is the sentence "I would like to buy a train ticket to Tokyo, please." as GPT-4 sees it. Each color block is one token:

I would like to buy a train ticket to Tokyo, please.

13 tokens

51 characters

0.25 tokens per character

For English, tokens map neatly onto words and punctuation. Each word gets its own token; spaces are folded into the following token. This is tidy and efficient.

Now look at the same sentence in French:

Je voudrais acheter un billet de train pour Tokyo, s'il vous plaît.

19 tokens (GPT-4)

15 tokens (GPT-5)

Notice that voudrais splits into three tokens (v · oud · rais), billet into two (bil · let), and plaît into two (pla · ît). The tokenizer is fragmenting words it encounters less frequently, or whose accented characters are rarer in its vocabulary.

Key insight: Token boundaries are not linguistic — they are statistical. The tokenizer learned which character sequences are common enough to deserve their own slot in the vocabulary, based on its training data. English text was vastly over-represented in that training data.

Dig deeper What exactly is a token, technically? ›

Modern LLM tokenizers use an algorithm called Byte Pair Encoding (BPE).¹ The process: start with individual bytes (the atomic units of any text file), then repeatedly merge the most frequently co-occurring pair of symbols into a new symbol. Do this thousands of times, and you end up with a vocabulary of ~50,000–200,000 tokens that efficiently encodes the most common patterns in the training corpus.

The vocabulary is fixed at training time. If a character sequence is rare in the training data, it never gets its own token — it remains a sequence of byte-level tokens, each representing a single byte of UTF-8 encoding. This is why uncommon characters can cost 2–3 tokens each.

The GPT-4 tokenizer (cl100k_base) has a vocabulary of 100,277 tokens. The newer GPT-5/O1 tokenizer (o200k_base) has 200,019 tokens — double the vocabulary, allowing more diverse character sequences to have dedicated tokens.

Section 2

The same idea, in seven languages

The sentence "I would like to buy a train ticket to Tokyo, please." — translated as faithfully as possible — produces very different token counts depending on the language.

Explore each below. The token count differences are not small, and they are not neutral: they reflect whose language was abundant in training data, and whose was not.²

I would like to buy a train ticket to Tokyo, please.

13 GPT-4 tokens

13 GPT-5 tokens

13 Claude tokens

English is the baseline. Every word gets its own token; punctuation gets its own token. This is the most efficient possible outcome — the tokenizer was optimized on enormous quantities of English text.

Je voudrais acheter un billet de train pour Tokyo, s'il vous plaît.

Je voudrais acheter un billet de train pour Tokyo, s'il vous plaît.

19 GPT-4 tokens

15 GPT-5 tokens

20 Claude tokens

French is close to English but accented characters and longer inflected verb forms create splits. Voudrais (would like) fragments into three tokens in GPT-4 (v · oud · rais), and billet (ticket) into two (bil · let). GPT-5's expanded vocabulary resolves these — it has seen enough French to give common verb forms their own tokens. Claude behaves similarly to GPT-4 here.

Me gustaría comprar un boleto de tren para Tokio, por favor.

Me gustaría comprar un boleto de tren para Tokio, por favor.

17 GPT-4 tokens

13 GPT-5 tokens

19 Claude tokens

Spanish tokenizes similarly to French — accented vowels and inflected forms create extra splits in older models. Note that Tokio (the Spanish spelling of Tokyo) splits into two tokens, while the English "Tokyo" stays whole — proper nouns in non-English spellings are underrepresented in training data.

Я хотел бы купить билет на поезд до Токио, пожалуйста.

Я хотел бы купить билет на поезд до Токио, пожалуйста.

28 GPT-4 tokens

14 GPT-5 tokens

27 Claude tokens

Russian illustrates the Cyrillic script penalty in older tokenizers. Almost every word fragments into 2–3 tokens. The Cyrillic alphabet was substantially underrepresented in GPT-4's vocabulary — characters that any Russian reader sees as simple, whole units get split into byte sequences. GPT-5 halved the token count, representing one of the largest improvements of the new tokenizer. Claude shows the same pattern as GPT-4.

Dig deeper Why does script matter so much? ›

UTF-8, the standard way of encoding text for computers, uses variable-length byte sequences. ASCII characters (basic English letters) each use 1 byte. Cyrillic, Arabic, and most other non-Latin scripts use 2 bytes per character. CJK characters (Chinese, Japanese, Korean) use 3 bytes.

When a tokenizer hasn't seen a character sequence enough times to give it a dedicated vocabulary slot, it falls back to encoding that sequence byte by byte. For a Russian word using Cyrillic characters, this can mean 2 tokens per character rather than 1 token per word. The result is dramatic inflation — Russian text that means the same thing as an English sentence might cost 2–3× as many tokens.

Token costs directly translate to API costs, response length limits, and processing time. A Russian speaker using GPT-4 was, in effect, paying 2–3× more per meaningful exchange than an English speaker. This is an invisible structural inequity baked into the architecture.

أريد أن أشتري تذكرة قطار إلى طوكيو لو سمحت.

.حمتس لو طوكيو إلى قطار تذكرة أشتري أن أريد

34 GPT-4 tokens

19 GPT-5 tokens

32 Claude tokens

Arabic shows heavy fragmentation in GPT-4 — 34 tokens for a 43-character sentence, a ratio worse than Russian. Arabic's right-to-left script, connected letterforms, and multi-byte Unicode encoding all contributed to poor vocabulary coverage in earlier training sets. GPT-5 improved substantially (19 tokens), though still not at parity with European languages. Claude is slightly better than GPT-4 here but in the same range.

Note: the token boundary visualization above is approximate for Arabic — the right-to-left display makes exact boundary representation in HTML complex. The token counts are verified; the visual boundaries are illustrative.

Dig deeper Arabic script and tokenization challenges ›

Arabic presents several unique challenges for tokenizers trained primarily on Latin-script text. First, the script is right-to-left, which means word boundaries appear in reverse order. Second, Arabic letters change form depending on their position in a word (initial, medial, final, or isolated forms) — the same letter has up to four visual variants. Third, Arabic is highly inflected, with roots modified by prefixes and suffixes that compound complexity. Fourth, the Unicode encoding of Arabic uses multi-byte sequences for each character.

The result is that GPT-4's tokenizer, trained predominantly on English and Romance-language web text, treats Arabic as a largely unfamiliar sequence of byte patterns rather than a structured language with predictable morphology. The improvement in GPT-5 reflects deliberate effort to include more Arabic text in the vocabulary-building phase.

A further complication: Arabic has significant variation between Modern Standard Arabic (MSA) — used in writing, news, and formal contexts — and the many spoken dialects, which vary substantially by region. The sentence used here is MSA, the appropriate register for a written educational example, but a speaker is unlikely to use this form of the language for something as mundane as buying a train ticket. This register gap has no real equivalent in English, and tokenizers trained on internet text may handle formal and colloquial Arabic very differently. The translation used here was verified by Heather Sweetser, Senior Lecturer of Arabic at UNM.

我想买一张去东京的火车票。

我想买一张去东京的火车票。

15 GPT-4 tokens

12 GPT-5 tokens

14 Claude tokens

Mandarin Chinese is a notable exception to the non-English penalty — it tokenizes efficiently across all three models, close to one token per character. The reason is likely the sheer volume of Mandarin text on the internet: Chinese is one of the most common languages online, so tokenizer vocabularies have good coverage of common characters. Each character tends to get its own token, which is efficient given that each character carries substantial meaning.

Note: the original sentence ended with 请 (please), a direct translation of the English. Native speaker Liping Yang (Geography, UNM) noted that Mandarin speakers would not typically end this kind of request with 请 — a thank-you at the end would be more natural. The sentence above reflects her correction. This is itself a small illustration of how direct translation can produce unnatural results — something LLMs also struggle with.

Dig deeper Why does Mandarin tokenize better than Russian? ›

This seems counterintuitive — Mandarin uses a completely different script (hanzi/Chinese characters) that requires 3 bytes per character in UTF-8, while Russian Cyrillic only needs 2 bytes. So why does Mandarin fare better?

The answer is training data composition. Mandarin Chinese has an enormous internet presence — a huge proportion of global web content is in Chinese. This means GPT-4's vocabulary-building phase saw enough Chinese character sequences to assign dedicated tokens to common characters. Russian Cyrillic, despite being used by hundreds of millions of people, had comparatively less representation in the training corpus.

This illustrates that tokenization efficiency is not primarily about linguistic or script complexity — it is about whose language was represented in the training data. It is a proxy for whose internet activity was included in the model's construction.

東京までの電車の切符を一枚買いたいのですが。

東京までの電車の切符を一枚買いたいのですが。

28 GPT-4 tokens

17 GPT-5 tokens

22 Claude tokens

Japanese is the most complex case — and Claude performs noticeably better here than GPT-4 (22 vs 28 tokens), though both are worse than GPT-5 (17). Japanese mixes three scripts: kanji (Chinese-derived characters), hiragana (phonetic syllable script), and katakana (phonetic script for foreign words). The hiragana grammatical particles and common verb endings (まで, の, を, が, ます) appear to have better vocabulary coverage in Claude's tokenizer, while kanji characters still tend to tokenize one per token. The mixed-script nature means Japanese text inherits the challenges of both CJK characters and phonetic script encoding.

Dig deeper Japanese mixed script and tokenization ›

Standard written Japanese uses three scripts simultaneously in a single sentence. Kanji carry core semantic meaning; hiragana handle grammatical structure (particles, verb conjugations, etc.); katakana are used primarily for foreign loanwords and emphasis. A tokenizer must handle all three, plus the transitions between them.

In GPT-4, kanji tend to tokenize as individual characters (one token per character), while common hiragana sequences that appear frequently — like grammatical particles の, を, が, は — may get dedicated tokens due to their high frequency. This explains why Claude, which apparently has better hiragana coverage, achieves 22 tokens vs GPT-4's 28: the common grammatical endings are being recognized as units rather than byte sequences.

This also means that the tokenization "penalty" for Japanese varies significantly depending on the text content. Academic or technical text heavy in kanji tokenizes worse than casual conversational text heavy in hiragana.

Section 3

The numbers, side by side

Three tokenizers, seven languages, one sentence. The pattern is not subtle.

All figures are for the sentence "I would like to buy a train ticket to Tokyo, please." (or its closest translation). Token counts for Claude are adjusted to remove API overhead; raw Claude API counts are 7 higher.

GPT-4 (cl100k_base)

GPT-5 / O1 (o200k_base)

Claude Sonnet 4.6

English

13·13·13

French

19·15·20

Spanish

17·13·19

Russian

28·14·27

Arabic

34·19·32

Mandarin

15·12·14

Japanese

28·17·22

GPT-4 and Claude Sonnet show nearly identical patterns — both imposing roughly 2–2.5× the English token cost for Russian, Arabic, and Japanese. Arabic is the highest-cost language in our data at 34 GPT-4 tokens for a 43-character sentence. GPT-5's expanded vocabulary (200k vs 100k tokens) substantially reduced the gap for all three languages. This is not a small technical detail: it represents a concrete, measurable improvement in equity of access — though what counts as genuine equity, as opposed to incremental improvement within an English-dominant framework, remains an open question.

Dig deeper A note on Qwen and multilingual tokenizers ›

The comparisons above are limited to OpenAI and Anthropic models. It is worth noting that Chinese-developed models like the Qwen family (Alibaba) use tokenizers explicitly optimized for CJK languages. Published benchmarks suggest Qwen tokenizes Mandarin Chinese at substantially better efficiency than GPT-4 — potentially approaching 1 token per character even for less-common characters, and handling Japanese kanji similarly well.

This reflects a broader principle: tokenizer design choices are not neutral. A model built by a team whose primary language is Mandarin will likely produce a tokenizer better suited to Mandarin. The dominance of English-language AI development has produced tokenizers that systematically advantage English speakers — and that advantage is only beginning to be corrected in newer model generations.

If you have access to a locally-running Qwen model (via Ollama or similar), it is worth running the same test sentence in Mandarin and Japanese to compare token counts directly.

Section 4

Why does this matter?

Tokenization is invisible infrastructure — but invisible infrastructure is never neutral.

Cost

2×

A Russian or Arabic speaker using GPT-4 paid roughly twice as many tokens per exchange as an English speaker — for the same meaning.

Context window

½

With a fixed context window, a non-English user effectively has half the conversational memory available — their messages consume more of the available space.

Three practical implications

1. Prompt engineering advice is language-dependent. Guidance like "keep your prompts concise" was developed by English-speaking users for English-speaking users. For a Russian or Arabic speaker, "concise" already costs more. The efficiency strategies that work for English may not transfer.

2. Performance benchmarks may be systematically biased. If a model is evaluated primarily on English-language tasks, its performance on Russian or Arabic at equivalent token budgets may look worse — not because the model is worse at those languages, but because it is operating at a structural disadvantage imposed by the tokenizer.

3. The "democratization" narrative requires scrutiny. AI is often described as a democratizing technology — equally available to all. Tokenization inequity is one concrete mechanism by which that claim is complicated. Access is not equal if the cost structure systematically advantages one language community over others.

Dig deeper Progress, its limits, and the broader AI equity landscape ›

The GPT-5/O1 tokenizer (o200k_base) measurably reduced the token count for Russian, Arabic, and Japanese compared to GPT-4 — roughly halving the gap in some cases. This is real progress. But it is worth asking: progress toward what, and on whose terms?

The improvements were driven partly by commercial incentives — tokenization inefficiency for non-English languages is a competitive disadvantage as AI products expand to global markets — and partly by deliberate effort from researchers who recognized the equity problem. Both motivations are real, and they are not mutually exclusive. But framing this as a problem being "solved" risks obscuring what remains unchanged: the fundamental architecture was designed with English-dominant assumptions, and incremental vocabulary expansion works within that architecture rather than questioning it.

Meshell Sturgis (Race and Communication, UNM) raised a sharper version of this point in response to this artifact: if the problem is structural, then improvement within the existing structure — becoming more efficient at the dominant use case — is not the same as re-evaluating the values that created the structure. There is also a related risk of treating "AI" as synonymous with "large language models," and then treating LLMs as if their current form is the natural or inevitable shape of the technology. In reality, LLMs are a specific design choice, trained in specific ways, by specific organizations, reflecting specific priorities. "Efficiency" is not a culturally neutral goal, and the tokenization patterns we observe are not laws of nature — they are consequences of choices. Whether AI systems should aim for equal treatment (same cost per token regardless of language) or something more equitable (outcomes that account for structural disadvantage) is a values question, not a technical one — and it is one the field has not yet seriously grappled with.

Claude's tokenizer appears not to have undergone a similar expansion yet — Claude and GPT-4 show nearly identical patterns across all seven languages in our data. The Qwen family of models, developed by Alibaba with explicit attention to CJK languages, demonstrates that different design priorities produce measurably different outcomes. Other AI systems — computer vision, recommendation algorithms, medical diagnostic tools, hiring screeners — carry their own embedded assumptions and their own equity implications. Tokenization is a useful entry point because it is concrete and measurable. But the critical lens it invites applies much more broadly.

Next in the series: Once text has been chunked into tokens, each token is mapped to a high-dimensional vector that captures something about its meaning. That process is explored in the embeddings explainer.

References

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. doi.org/10.48550/arXiv.1508.07909 ↩
Petrov, A., La Malfa, E., Torr, P. H. S., & Bibi, A. (2023). Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36. doi.org/10.48550/arXiv.2305.15425 ↩

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. Token counts were empirically verified in April 2026: GPT-4 and GPT-5/O1 figures from platform.openai.com/tokenizer; Claude figures from direct API calls to the count_tokens endpoint. Token boundary visualizations are illustrative approximations; the counts themselves are precise. Translation accuracy was verified by native and expert speakers — attributed in context where relevant.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.