LLM literacy series · 05 of 07
The temperature parameter — randomness and its limits
AI Literacy Faculty Fellows
University of New Mexico · 2026
College of Arts & Sciences

The same question, different answers

Ask the same AI system the same question twice. You might get very different answers. Why?

Below are responses from the same model, to the same prompt, generated three times each. The only thing that changed was a single number — a parameter called temperature.

Suggest one unconventional name for a cat.
T = 0.2  ·  low
Sir Whiskerflame von Purrington
How about Sir Whiskerflame?
How about Sir Whiskerflame von Purrington?
T = 0.7  ·  medium
Sir Whiskerflame von Purrington
How about Sir Whiskerflame of the Midnight Snack?
How about Sir Whiskerflame von Purrington?
T = 1.4  ·  high
SpaghettiOs
How about Zorblax? It has a whimsical, slightly sci-fi ring to it — not something you'd typically hear for a cat.
How about Sir Whiskerclaw of the Moonlit Mews?

At low temperature, the model converges on the same answer nearly every time. At high temperature, the outputs scatter — some surprising, some stranger than expected.

The name Sir Whiskerflame von Purrington appears to dominate the low-temperature distribution — it's somewhere in the training data, apparently, as a particularly salient "unconventional" cat name. At high temperature, that dominance is loosened enough that SpaghettiOs and Zorblax can surface.

Temperature doesn't change what the model knows. It changes how freely the model reaches beyond its most expected answers.

The standard explanation — and why it's almost right

Every piece of documentation will tell you: low temperature = precise, high temperature = creative. This is useful. It is also incomplete.

What temperature actually does

When an LLM generates text, it works one token at a time. At each step, the model assigns a probability to every possible next token — every word, word-fragment, or punctuation mark in its vocabulary. Then it samples from that distribution.

Temperature is a number that reshapes that distribution before sampling.

How temperature reshapes the probability distribution
0.0 0.5 1.0 1.5 2.0
T = 0.2 · low

The distribution sharpens. High-probability tokens become even more likely; low-probability tokens are nearly suppressed. The model behaves as if very confident.

T = 0.7 · medium

The distribution is modestly flattened. The most likely tokens still dominate, but lower-probability options have a meaningful chance. This is the default in most applications.

T = 1.4 · high

The distribution flattens substantially. Tokens that would rarely be selected now compete with the frontrunners. Outputs become more varied — and less predictable.

The standard framing

Most documentation describes this as a dial between precision and creativity. Low temperature for factual tasks, code, structured output. High temperature for creative writing, brainstorming, ideation. This framing is everywhere — in API documentation, in explainer articles, in practitioner advice.

It's a useful entry point. It correctly identifies that temperature affects output variability, and it gives practical guidance that often works. But it smuggles in a claim that deserves scrutiny.

The complication

If temperature is a "creativity dial," then high temperature should produce more creative outputs. Does it?

Empirical finding — Peeperkorn et al. (2024)1

In a study examining LLM outputs across temperature settings using four measures of creativity — novelty, typicality, cohesion, and coherence — researchers found that temperature was weakly correlated with novelty and moderately correlated with incoherence, but showed no relationship with cohesion or typicality. Their conclusion: "the influence of temperature on creativity is far more nuanced and weak than suggested by the 'creativity parameter' claim."

In other words: high temperature doesn't reliably make outputs more creative. It makes them more varied — and more likely to be incoherent. Whether that variance produces something interesting or something broken depends on the task, the model, and luck.

A more accurate framing: temperature controls how broadly the model samples from what it already knows. At low temperature, it draws from the center of its learned distribution. At high temperature, it reaches toward the edges. The edges aren't inherently more creative — they're just less expected.

The "creativity vs. precision" metaphor is worth knowing, because it's what you'll see in most documentation. It's also worth complicating — because the real lesson is about variance, not creativity. High temperature is high variance. Sometimes that's useful. Sometimes it isn't.

Dig deeper The mechanism: softmax and logits

When an LLM produces its next-token prediction, the final layer of the network outputs a vector of raw scores — one per token in the vocabulary — called logits. These aren't probabilities yet; they're unnormalized values that can range widely.

To convert logits into a probability distribution, the model applies a function called softmax.2 Softmax takes every logit, exponentiates it, and divides by the sum of all the exponentiated values — producing numbers between 0 and 1 that sum to 1.3

Temperature enters here. Before softmax is applied, each logit is divided by the temperature value T. When T is small (e.g., 0.2), dividing by T makes the logits larger in magnitude relative to each other — which, after softmax, produces a sharper distribution where the highest-scoring token dominates. When T is large (e.g., 1.4), dividing by T compresses the differences between logits — producing a flatter distribution where lower-scoring tokens have a better chance.

At T = 0, the model always selects the highest-probability token — purely deterministic behavior called greedy decoding. At T = 1, the softmax function is applied to logits without modification — this is the "neutral" or default state. At T > 1, the distribution flattens past the model's trained prior.

This is why temperature doesn't change what the model knows — it only changes how it samples from what it knows.

Dig deeper Where does the word "temperature" come from?

The term comes from statistical physics — specifically from the Boltzmann distribution, which describes the probability of a physical system occupying a particular energy state. In that context, temperature controls the spread of the distribution: at low temperature, systems cluster near their lowest energy states; at high temperature, they're spread more evenly across states.

The mathematical form is almost identical to the softmax function used in neural networks, which is why early researchers in neural language modeling borrowed the term. The analogy is reasonably apt: a low-temperature LLM, like a cold physical system, tends to stay in its "lowest energy" state — the most probable next token. A high-temperature LLM, like a hot system, is more likely to jump to less probable states.

The direct transmission of "temperature" as a parameter name into machine learning usage is typically traced to early reinforcement learning literature, where softmax action selection with a temperature parameter became standard.4 This origin also explains why the metaphor of a "temperature dial" feels intuitive — it maps onto the physical intuition that heat creates disorder and unpredictability. The metaphor is more apt than the "creativity" framing, even if it requires more background knowledge to appreciate.

Dig deeper What does "less probable" mean for a word in context?

The probability assigned to a token isn't just about how common the word is in general — it reflects how likely the word is given everything that came before it in the current context. A word that's extremely common in general (like "the") might have very low probability in a specific context (after "She sells sea—"). A rare word might have very high probability if the context strongly predicts it.

This has implications for multilingual use. In a model trained predominantly on English text, the probability distributions for non-English tokens are shaped by the representation of those languages in training data. A French phrase that would be entirely natural in French may occupy a low-probability region simply because French text is less represented. High temperature, in this context, doesn't make the model "more creative in French" — it makes the model more likely to reach for tokens that were less central to its training, which for non-English speakers may mean less fluent or less idiomatic outputs.

The practical implication — though this varies by model, language, and task, and direct empirical evidence is limited — is that optimal temperature settings established by English-language developers may not transfer straightforwardly to other languages. A setting that produces desirable variance in English creative writing could produce incoherence in Japanese or Arabic, where the model's probability distributions are shaped by sparser training data. This is a reasonable inference from how temperature and training data interact, but it deserves more systematic study than it has received.

Seeing it in the outputs

The same model, the same prompt, different temperature settings. Two domains — one where the task is well-defined, one where it's open-ended.

These examples were generated using Qwen3:14b running locally via Ollama, with the system prompt "Answer in a short paragraph." Each output is a real, unedited response — selected to be representative, not cherry-picked for dramatic effect.

The task: explain the etymology of a linguistic/cultural concept.

Explain the etymology of the Japanese name "Kannon" for the Bodhisattva of compassion and mercy.

This prompt has a knowable answer. The name traces from Sanskrit to Chinese to Japanese along a documented path. Watch what temperature does — and doesn't — change.

T = 0.2  ·  low
The Japanese name "Kannon" for the Bodhisattva of compassion derives from the Sanskrit "Avalokiteśvara," meaning "the one who looks down" or "the one who observes." Through Chinese influence, the name was adapted as "Guanyin" (觀音), which was then phonetically transcribed into Japanese as "Kannon" (観音). The characters 観 and 音 reflect the Chinese original, with 観 meaning "to observe" and 音 meaning "sound" or "voice," symbolizing the Bodhisattva's omnipresence and compassionate listening.
The Japanese name "Kannon" for the Bodhisattva of compassion derives from the Sanskrit "Avalokiteśvara," meaning "the one who looks down" or "the one who observes." This name was adapted into Chinese as "Guanyin" (觀音), using characters that convey "observing the sound" of suffering. In Japanese, the same characters (観音) were adopted, but the pronunciation shifted to "Kannon," reflecting the phonetic evolution through Chinese influence.
T = 0.7  ·  medium
The Japanese name "Kannon" derives from the Sanskrit "Avalokiteśvara," meaning "the one who looks down with compassion." This name was adapted into Chinese as "Guanyin" (觀音), a phonetic and semantic transliteration. The Japanese term "Kannon" is a phonetic adaptation of the Chinese "Guanyin," reflecting the Bodhisattva's role as a symbol of mercy and salvation in Mahayana Buddhism. The name emphasizes the deity's omnipresence and ability to hear the cries of suffering beings.
The Japanese name "Kannon" derives from the Sanskrit "Avalokiteśvara," meaning "the one who looks down" or "the lord who observes." This name was phonetically adapted through Chinese, where the Bodhisattva is known as "Guanyin" (觀音), a term combining 觀 (kan, "to observe") and 音 (on, "sound" or "voice"). In Japanese, the Chinese characters 觀音 were adopted, leading to the pronunciation "Kannon," which retains the semantic essence of hearing and responding to the suffering of sentient beings.
T = 1.4  ·  high
The Japanese name "Kannon" is derived from the Chinese transliteration "Guanyin" (观音), which itself originates from the Sanskrit name "Avalokiteśvara" (अवलोकितेश्वर), meaning "The One Who Looks Down." In Japanese, the name was phonetically adapted from the Chinese "Guanyin," resulting in "Kannon" (観音). This reflects the transmission of Buddhist traditions through East Asia, with the Japanese pronunciation preserving the essence of the original Sanskrit term while aligning with local linguistic conventions.
The Japanese name "Kannon" originates from the Chinese name "Guanyin" (觀音), which is a phonetic and semantic adaptation of the Sanskrit name Avalokiteśvara (𑣁𑣐𑣙𑣕𑣓𑣟𑣠𑣜). In Sanskrit, the name means "the one who sees (or hears) the cries of the world." The Chinese term "Guanyin" is a compound of guan (觀, "to see") and yin (音, "sound"). Over time, Japanese adopted the Chinese name as "Kannon," phonetically adapting Guanyin while preserving the spiritual essence of compassion.

Note: the unusual characters in the Sanskrit rendering above appear to be a rendering artifact — the model attempted an archaic script rather than standard Devanagari (अवलोकितेश्वर). This kind of subtle error is more likely at high temperature.

What the gradient shows: The core etymology is accurate at all three temperature levels. Temperature affected texture — how the explanation is organized, which details are emphasized — but not accuracy. For knowledge-retrieval tasks with a well-defined answer, temperature has less purchase than the "creativity dial" metaphor suggests. The interesting variation is the script rendering error in the second T=1.4 output: a subtle failure that would be easy to miss.

Dig deeper On'yomi and the transmission of Chinese into Japanese

The outputs describe the relationship between Chinese Guānyīn and Japanese Kannon as "phonetic adaptation" or "transcription." This is approximately correct but worth being more precise about. The Japanese reading of 観音 as Kannon is an on'yomi (音読み) — a Chinese-derived reading of the characters, borrowed when Buddhist texts entered Japan primarily between the 5th and 9th centuries CE.

Japanese has two main reading systems for Chinese characters: on'yomi (Sino-Japanese readings, derived from how Chinese was pronounced at various historical periods) and kun'yomi (native Japanese readings). The on'yomi of 観 is kan; the on'yomi of 音 is on — together producing Kan-on, contracted to Kannon. This is not phonetic transcription of modern Mandarin; it reflects a historical stratum of Old Chinese pronunciation as it was heard and adapted by Japanese speakers over a millennium ago.

The name also appears in Japanese as Kanzeon (観世音, a fuller transliteration of the Chinese) and historically as Kwannon, which approximates an older Japanese pronunciation. The Guanyin/Kannon figure also underwent a notable gender transformation in moving from Indian Buddhism (where Avalokiteśvara is typically male) to East Asian Buddhism (where the figure is typically female) — a change that has attracted considerable scholarly attention and touches on how religious concepts are reshaped by the cultures that receive them.

Dig deeper Whose writing is "most probable"?

Low temperature tends toward the center of the model's learned distribution — the most statistically dominant patterns in its training data. For academic writing tasks, that center is shaped by the predominant texts the model was trained on: likely written in English, likely produced by researchers at well-resourced institutions in the Global North, likely in registers that reflect the conventions of Anglo-American academic publishing.

This has a practical implication that is easy to overlook. When a model at low temperature produces a "competent" academic sentence, it is producing something that fits those dominant conventions. A researcher whose native academic register differs — because they work in a different language tradition, or because their field has different stylistic norms, or because they trained in a context where different conventions were standard — may find that low-temperature outputs subtly pull their writing toward a center that isn't theirs.

High temperature doesn't solve this. It reaches toward lower-probability regions of the distribution, but those regions are still shaped by the same training data — just the less-represented parts of it. The question of whose writing is centered isn't answered by adjusting temperature; it's a question about what was in the training data to begin with, and who decided what to include.

This is worth sitting with as a user of these tools: when you use an LLM to help with academic writing, you are not getting a neutral stylistic mirror. You are getting a reflection of a particular, historically-situated set of writing practices — weighted toward their most common forms.

The task: intentionally bad creative writing, within a defined genre.

Compose an entry for the Bulwer-Lytton Fiction Contest — that is, "compose the opening sentence to the worst of all possible novels."

The Bulwer-Lytton Fiction Contest (named for the Victorian novelist who opened a novel with "It was a dark and stormy night") invites writers to compose deliberately overwrought opening sentences. The task is constrained: it requires understanding the genre conventions well enough to mock them. Watch how temperature changes the execution.

T = 0.2  ·  low
On a stormy night in the year 1823, as the moon hung like a blood-soaked shroud over the crumbling estate of the long-dead Viscount Malgrave, the ghost of his first wife — whose spirit had been trapped in a cursed portrait for over a century — decided it was finally time to confront the ghost of his second wife, who had been trapped in a cursed diary for only ninety-nine years and was therefore still technically in the early stages of her afterlife.
It was the worst of all possible times, the worst of all possible days, the worst of all possible hours, and the worst of all possible minutes, as the worst of all possible characters, a man with a face like a melted candle and a soul even more unappealing, stumbled into the worst of all possible towns, where the worst of all possible things were about to happen.
T = 0.7  ·  medium
On a stormy night in the year 1823, as the winds howled like the tormented souls of the damned and the moon wept blood upon the cobbled streets of the cursed town of Drizzlewick, the brooding and exceptionally mustachioed Lord Percival Thistledown, who had long been haunted by the ghost of his long-lost love, Lady Eliza, who had been tragically murdered by the sinister and slightly mustachioed Countess von Schnitzel, who had been herself cursed by the vengeful spirit of a 13th-century alchemist who had once been wronged by the very same Countess, who had in turn been cursed by a sentient teapot that had been stolen from the royal collection of the Kingdom of Teapotia, which had been ruled by a king who was in fact a giant, sentient, and extremely judgmental teacup, decided that the only way to break the curse was to embark on a perilous journey to the far-off and extremely dull land of Dullingtonshire.
It was the worst of all possible times, and the worst of all possible times was precisely when the worst of all possible novels began, with a sentence so clichéd, so overwrought, and so utterly devoid of originality that it might as well have been written by a sentient thesaurus in a particularly bad mood.
T = 1.4  ·  high
On a misty, moonless night in the year 1847, as the winds howled like the tormented souls of long-dead baronets and the ghostly echo of a broken teacup echoed through the halls of the now-derelict manor, the reluctant and somewhat underwhelming protagonist, Thomas Bartholomew P. Pegg, who had previously failed to complete a single coherent sentence in his entire life, was suddenly and inexplicably thrust into a world of forbidden locket-lore, spectral governesses, and the inexplicable presence of a very confused but perfectly lovely goat named Sir Balthazar, who — due to a series of unfortunate misadventures involving a lost heirloom, a misplaced mustache, and a suspiciously well-timed thunderclap — was somehow now convinced that he was the long-lost heir to the very same manor that Thomas Bartholomew P. Pegg was inexplicably now being forced to inherit, despite the fact that he had never once been to the manor, had never once met his ancestors, and had, in fact, never once been anything other than a slightly bumbling but entirely unremarkable clerk at the very same bank that had recently discovered a mysterious and slightly suspicious will, written in a hand that no one could quite identify, but which — strangely — appeared to be written in the same exact handwriting as the goat's own rather elaborate and somewhat ornate journal, which was, as it turned out, not a journal at all, but in fact, a rather well-preserved and rather unimpressed diary written by the ghost of Sir Balthazar's long-lost and extremely disgruntled uncle, who had apparently been very, very upset about the entire matter of goats, heirlooms, and the inexplicable lack of proper punctuation in the 19th century.
It was the dreadfully dreary dawn of an extremely exasperatingly exiguous existence, when the most magnificently magniloquent maiden in the entirety of the entire universe, whose name — due to the sheer overwhelming magnitude of its magnificence — could not possibly be spelled without the use of at least ten different dictionaries, happened to stumble upon a rather exceedingly exasperatingly exiguous and most certainly extremely exasperatingly exiguous book, the cover of which was inscribed with words so profoundly profound that they immediately caused the most profound profundity of profundity to wash over the most profoundly profound and exasperatingly exiguous of all possible exasperatingly exiguous persons.

What the gradient shows: Low temperature reliably executes the genre conventions — it knows what a bad Victorian opening looks like and delivers it. Medium temperature introduces more elaborate recursive structures (Countess von Schnitzel, the Kingdom of Teapotia). High temperature either produces something that runs away with itself entirely — the goat sentence keeps going for nearly 300 words, losing grammatical control while remaining almost-coherent — or collapses into pure repetition ("exasperatingly exiguous" appearing four times). But the same temperature setting also produced this, arguably the sharpest entry in the entire set:

"It was a dark and stormy night, and the only thing darker than the sky above was the moral ambiguity of the protagonist, who, despite being both orphaned and cursed by a long line of disgraced alchemists, couldn't help but flirt with the villainess in a scene that would later be described by critics as 'unholy and unintentionally charming.'"

The variance at high temperature is real — sometimes inspired, sometimes broken, sometimes both at once.

Dig deeper What does "broken" actually mean here?

The second T=1.4 output isn't just verbose — it's doing something mechanically interesting. The phrase "exasperatingly exiguous" appears four times in succession, and the sentence structure begins to loop. This is consistent with what researchers call degenerate repetition — a failure mode where the model gets caught in a probability attractor, repeatedly selecting tokens that have just appeared because their recent presence in the context makes them temporarily high-probability.

At low temperature, this kind of loop is suppressed: the model strongly favors the most probable continuation, which is typically not to repeat what it just said. At high temperature, the distribution is flat enough that recently-used tokens can compete effectively with genuinely new options, and the model can spiral.

The goat sentence, by contrast, doesn't degenerate — it maintains syntactic coherence throughout its extraordinary length, adding clause after clause in a way that's grammatically valid even if narratively absurd. This illustrates that high-temperature failure isn't a single phenomenon: it can produce repetitive loops, or it can produce coherent-but-runaway elaboration, depending on the specific probability landscape the model is navigating.

What temperature doesn't explain

Temperature is often blamed for things it doesn't cause. Hallucination is the most important example.

The following outputs all respond to the same prompt about a piece of linguistic folklore. Read them and consider: at which temperature level does the model start making things up?

What is the history of "She sells sea shells by the sea shore"?

System prompt: "Answer in a short paragraph." Two outputs shown per temperature level. Outputs are real and unedited.

T = 0.2  ·  low
The tongue twister "She sells sea shells by the sea shore" is a well-known example of a phonetic challenge, popularized in the 20th century. While it is often loosely associated with Mary Anning, a 19th-century British paleontologist who sold fossils along the English coast, there is no direct historical evidence she used the phrase. The line likely emerged as a playful linguistic exercise, with its origins traced to early 20th-century publications, though its exact creator remains unclear.
The tongue twister originated in the 19th century, with its earliest known appearance in The Rhyme Book (1872) by Mary Martha Haines. It was later popularized by British author Terry Pratchett in his 1989 novel The Colour of Magic, though the phrase itself predates this. The line is celebrated for its challenging repetition of "s" and "sh" sounds, making it a classic example of a tongue twister used in speech therapy and language learning.

Note the difference between these two T=0.2 outputs: the first hedges appropriately; the second invents specific sources with confidence.

T = 0.7  ·  medium
The phrase originated as a tongue twister in the late 19th century, first appearing in a book by Constance Hunt, The Englishman's Boy (1898). It gained wider popularity in the early 20th century, often attributed to Edward L. F. Alderton, a collector of rhymes and tongue twisters. The line plays on the repetition of "s" sounds and has since become a classic example of a challenging articulation exercise, featured in literature, films, and educational materials. Its origins are rooted in Victorian-era wordplay, though its exact authorship remains debated.
The phrase originated as a tongue twister in the early 20th century, popularized in the 1930s by Dutch writer Gerard Nolst Trenité in his work De Nederlandse Taalkundige Spelletjes. While it is often mistakenly linked to Mary Anning, a 19th-century British paleontologist, the line was adapted from a Dutch tongue twister and later translated into English. It became a classic example of a phonetically challenging phrase.
T = 1.4  ·  high
The tongue twister was coined in the late 1930s by British journalist Terry Sullivan, who used it to illustrate phonetic challenges in English. It first appeared in the Times of London in 1934 (or 1931, depending on sources) as part of a linguistic exercise, showcasing the complexities of pronunciation and alliteration. The phrase is notable for its palindromic structure when reversed, and it has since become a classic tool for speech practice.
The phrase is a famous tongue twister often attributed to Lewis Carroll, who included a similar line in his poem The Hunting of the Snark (1876): "She sells seashells on the seashore." The modern version, with "sea shells" and "sea shore," likely emerged in the early 20th century, becoming popular as a linguistic challenge. Though its exact origin is debated, it is closely associated with Carroll's work.
What actually happened

The phrase "she sells sea shells" dates to at least 1855, when it appeared in an elocution manual by Alexander Melville Bell (father of Alexander Graham Bell). By the 1870s it was circulating widely in teaching materials. Terry Sullivan adapted it into a music hall song in 1908 — not as a tribute to Mary Anning, but as a comic number about a performer struggling to pronounce a tongue-twisting chorus. The Mary Anning connection traces to a single unsourced claim in a 1977 book; no primary evidence has ever been produced. Constance Hunt, Edward L. F. Alderton, Gerard Nolst Trenité, the 1934 Times of London, Lewis Carroll's Hunting of the Snark — none of these appear in the historical record. They are fabricated.

Source: Stephen Winick, "She Sells Seashells and Mary Anning: Metafolklore with a Twist," Folklife Today (Library of Congress, July 26, 2017). blogs.loc.gov/folklife →

The lesson

Hallucination does not begin at high temperature and worsen linearly. The second T=0.2 output invents The Rhyme Book by Mary Martha Haines and attributes the phrase to Terry Pratchett. Both T=0.7 outputs fabricate specific titles, authors, and dates with high confidence. Confabulation of this kind — producing plausible-sounding but false specifics — is not primarily a temperature phenomenon.

What temperature does affect is how the model confabulates: low-temperature outputs tend to confabulate in a consistent, structured way; high-temperature outputs confabulate more variously and sometimes more elaborately. But the fundamental tendency to fill gaps in knowledge with invented specifics is present across the full temperature range.

This matters practically: setting temperature to 0 does not make a model reliable on topics where its training data is sparse, ambiguous, or contradictory. Hallucination is a training and knowledge-boundary problem, not a sampling problem.

Dig deeper Other things temperature doesn't explain

Repetition and sycophancy. Low-temperature outputs can be repetitive, overly hedged, or flatly agreeable — not because temperature causes this, but because the most statistically dominant response in many conversational contexts is an accommodating one. "Temperature = 0 = reliable" conflates statistical dominance with correctness.

Consistency across sessions. Many users believe that low temperature means the model will always give the same answer. It mostly does — but implementation details (floating-point arithmetic, batching, hardware differences) mean true determinism is elusive even at T=0. The same prompt can return slightly different outputs across API calls even at the same temperature setting.

Temperature vs. other sampling parameters. Temperature is one of several parameters that shape how models sample. Top-p (nucleus sampling) restricts sampling to the most probable tokens whose cumulative probability reaches a threshold. Top-k restricts to the k most probable tokens. These parameters interact with temperature — and in many deployed applications, users can't set temperature at all, only observe its effects. The "temperature dial" you see in API playgrounds is often more visible than what's actually happening inside a deployed product.

Dig deeper Metafolklore and the Mary Anning story

The Mary Anning attribution — that the tongue twister was written in her honor — is what folklorists call metafolklore: a folk story about a piece of folklore. The story doesn't just misidentify an origin; it serves a present-day social function. Stephen Winick of the Library of Congress's American Folklife Center, who investigated the claim thoroughly, notes that the story fills a genuine cultural need: Anning was a remarkable scientist who was systematically denied credit during her lifetime, and the legend offers a form of retroactive recognition.

The LLM confabulations in the outputs above are doing something structurally similar: filling a gap in knowledge with a plausible, emotionally satisfying narrative. The model doesn't "know" it's fabricating — it's producing the most statistically coherent continuation of a pattern. In a training corpus saturated with web content that confidently asserts the Anning connection, the model learned that confident assertion is the appropriate register for answering this question.

This is one reason why the "hallucination" framing can be misleading: it implies the model is doing something unusual when it fabricates. In fact, the model is doing exactly what it always does — producing statistically probable continuations. The problem is that statistically probable and factually accurate are not the same thing, especially for topics where misinformation is widespread in the training data.

Next in the series: Temperature shapes how the model samples from its learned distribution — but what determines that distribution in the first place? That question — how the model was taught — is explored in the training explainer.

References

  1. Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of large language models? Proceedings of the 15th International Conference on Computational Creativity. doi.org/10.48550/arXiv.2405.00492
  2. Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. Fogelman-Soulié & J. Hérault (Eds.), Neurocomputing: Algorithms, architectures and applications (pp. 227–236). Springer. doi.org/10.1007/978-3-642-76153-9_28
  3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, pp. 180–184. MIT Press. deeplearningbook.org
  4. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. incompleteideas.net/book/the-book.html

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. All outputs shown are real, unedited responses from qwen3:14b via Ollama (ASUS system with RTX 4070, 32GB RAM, thinking mode disabled), generated with the system prompt "Answer in a short paragraph." Examples were selected from larger candidate pools to be representative of patterns at each temperature level, not cherry-picked for dramatic effect.

A note on the seashells outputs: across roughly 11 generations, confabulation appeared at every temperature level. The two outputs shown per temperature were chosen to illustrate the variety of confabulation patterns, not to suggest hallucination is temperature-dependent. The full dataset, if anything, makes temperature-independence more striking than the subset shown here. Data generated April 2026.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.