LLM literacy series · 06 of 07
How language models are trained
AI Literacy Faculty Fellows
University of New Mexico · 2026
College of Arts & Sciences

The same model. Different behavior.

Two outputs. Same model family. Same prompt. The only difference is where in a training pipeline each one comes from.

Base model — pretraining only
I'm feeling overwhelmed with work. What should I do?
Instruct model — after post-training
I'm feeling overwhelmed with work. What should I do?

Outputs from Ministral-3-8B Base and Ministral-3-8B Instruct. Temperature 0.1. Displayed token-by-token as they would appear in a chat interface.

The base model isn't broken. It's doing exactly what it was trained to do: continue a piece of text. The problem is that nobody told it this was a conversation.

Post-training — supervised fine-tuning and reinforcement learning from human feedback — is what transforms a text completion engine into something that behaves like an assistant. This artifact opens that box.

Pretraining: learning to continue text

Before a language model can assist anyone, it learns to do one thing well: predict what comes next.

If you've read the earlier explainers in this series, you know how a trained language model processes input — how it tokenizes text, builds embeddings, applies attention, and samples from a probability distribution over possible next tokens. What none of those explainers addressed is where all of that comes from. Pretraining is the answer.

The objective is deceptively simple: given a sequence of text, predict what comes next. The model sees a passage, masks the final token, and tries to produce it. If it's wrong, its parameters get adjusted. This happens billions of times, across hundreds of billions of tokens of text. By the end, the embeddings you learned about — those dense vectors that capture meaning — have been shaped by every one of those adjustments. The attention weights that determine what the model "pays attention to" in context were learned the same way.

The scale is genuinely hard to hold in your head. Modern frontier models are pretrained on trillions of tokens — more text than any human could read in thousands of lifetimes. That corpus includes books, websites, code, scientific papers, forum discussions, news archives, and much else besides.

The base model doesn't "know" that you asked it a question. It knows that text like yours is usually followed by more text like yours.

This has a concrete consequence for what the base model does. It has internalized the formats of the documents it was trained on — not just the facts. When you give it a prompt that looks like the beginning of a textbook chapter, it continues the chapter. When you give it something that looks like a list of quiz questions, it adds more quiz questions. And when it finds a sequence that closely matches something in its training data, it may continue that text very literally indeed.

Actual base model output — Ministral-3-8B-Base, prompt: "List three causes of World War I."
Asked by greta r #218157 on 11/20/2013 12:00 AM
Last updated by jill d #170087 on 11/20/2013 12:05 AM

##### Answers 1

Add Yours
Answered by jill d #170087 on 11/20/2013 12:05 AM
The model continued the prompt as if it were a homework help website — complete with user IDs and timestamps. This is the training corpus speaking, not the model "thinking."

This moment — where the model's output reveals the texture of its training data — is worth sitting with. It's not a failure mode. It's a window into what pretraining actually produces: a very sophisticated pattern-completion system whose "knowledge" is inseparable from the documents that knowledge came from.

Dig deeper What's actually in the training corpus?

The precise composition of pretraining corpora for frontier models is not fully public. What is known comes from published research papers and technical reports, which vary in detail. Common sources include web crawls (Common Crawl being the largest), books (including those scraped without explicit licensing), code repositories, Wikipedia, academic papers, and curated high-quality text sources.

The representation of languages other than English is significantly lower in most major corpora. This is not just a technical inconvenience — it means that models trained on these corpora are better at some languages than others, reflect the cultural perspectives of the documents that dominate the corpus, and may perform substantially worse for speakers of underrepresented languages. For a program with two-thirds of participants having direct stakes in language and multilingual issues, this isn't a footnote.

Questions of consent — whether the authors of the texts included in training corpora agreed to their use — are actively contested legally and ethically. Several major lawsuits are ongoing as of 2026. The training data for most frontier models includes copyrighted text whose inclusion was never authorized by copyright holders.

Dig deeper How does next-token prediction produce "knowledge"?

It seems like predicting the next word shouldn't produce anything resembling understanding. But the task is harder than it sounds. To predict that "The capital of France is ___" ends with "Paris," the model needs to have encoded that relationship somewhere in its parameters. To predict natural continuations of emotionally complex text, it needs to have encoded something about human emotional responses. To predict what comes after a logical argument, it needs to have internalized something about logical structure.

This is not the same as "understanding" in any deep philosophical sense — there is genuine disagreement among researchers about what, if anything, is happening beyond sophisticated statistical association. But it does explain why models trained this way turn out to be capable of tasks — translation, reasoning, summarization — that were never explicitly in the training objective. The capability to do those things was implicit in the prediction task.

The fact that these capabilities emerge from prediction, rather than being explicitly designed, is part of what makes them hard to fully explain. More on this in the final section.

Alignment: teaching the model to be useful

A pretrained model has internalized an enormous amount about language. It still doesn't know it's supposed to be helpful. Two further stages — supervised fine-tuning, then reinforcement learning from human feedback — are what turn raw predictive capacity into something that behaves like an assistant.

Supervised fine-tuning

Supervised fine-tuning (SFT) is structurally similar to pretraining — the model sees text and adjusts its parameters to produce better predictions — but the training data is fundamentally different. Instead of a broad web crawl, the model trains on a carefully curated set of demonstrations: examples of what a good assistant response to a given prompt looks like. Human annotators write these demonstrations, showing the model not just what to say but how to say it — in what register, with what structure, with what kind of directness or hedging.

The shift in behavior this produces is substantial. After SFT, the model begins to treat prompts as instructions rather than text to continue. It learns that a question like "What should I do?" is a request for advice, not an invitation to generate more questions. It learns that responses should typically be coherent, addressed to the user, and terminated at a reasonable point rather than running indefinitely.

What SFT cannot do, by itself, is fine-tune the model's judgment about which responses are better or worse when multiple acceptable responses exist. It can show the model good examples, but it can't teach the model to rank options or navigate trade-offs. That's what the next stage adds.

Dig deeper Who writes the demonstrations, and does it matter?

The human annotators who write SFT demonstrations are typically contractors working through data labeling companies, often in lower-income countries where this work is available at scale. A 2023 investigation1 documented that OpenAI outsourced data labeling to workers in Kenya earning less than $2 per hour, some of whom were exposed to graphic depictions of violence and abuse as part of the work. Workers described psychological harm; counseling services they were nominally offered were difficult to access due to productivity demands.

The demographics of annotation workforces matter beyond labor conditions. The demonstrations these workers write embed cultural assumptions, rhetorical norms, and implicit values. A model trained to respond helpfully and politely is being trained to match a particular cultural conception of helpfulness and politeness — one that is shaped by whoever wrote the training demonstrations.

This is not a fully solvable problem, but it is a legible one. Understanding that SFT involves human-authored demonstrations makes visible that the "assistant personality" of a language model was designed, not discovered — and designed by a specific group of people in specific circumstances.

Reinforcement learning from human feedback

Reinforcement learning from human feedback — RLHF — is the stage that most directly shapes the assistant behavior you experience when using a language model. It's also the stage that has attracted the most public controversy, and the one whose internal details are least transparent.

The basic structure: the model generates multiple responses to a prompt, and human raters indicate which response they prefer. These preference judgments are used to train a separate model — called a reward model or preference model — that learns to predict which outputs humans will prefer. The language model is then trained using reinforcement learning to produce outputs that score highly on this reward model.2

The result is a model that has been shaped to produce responses that humans, in aggregate, tend to rate favorably. This is where the instruct model's direct address, structured formatting, empathetic acknowledgment, and calibrated hedging come from — not from explicit programming, but from optimization toward human approval.

It's worth being direct about what this means: the model's "values" — its apparent preferences, its ethical judgments, its conversational persona — are to a significant degree the averaged preferences of the people who rated its outputs. Who those people are, what assumptions they brought to their ratings, and how their preferences were aggregated are questions with real consequences for what the model does.

Dig deeper Constitutional AI and RLAIF

Anthropic (the company behind Claude) has published research on an alternative approach called Constitutional AI (CAI),3 which uses a written set of principles — a "constitution" — to guide model behavior rather than relying purely on human preference ratings. In this approach, a model critiques and revises its own outputs against the constitution, and a separate AI system (rather than human raters) provides preference judgments for reinforcement learning — a variant sometimes called RLAIF (reinforcement learning from AI feedback).

The degree to which frontier models actually use these approaches, versus more conventional RLHF, is not fully public. Anthropic has published the research but the exact training pipelines for deployed models involve details that are not disclosed. This is true of all major AI labs — the published research describes methods, but the specific implementation choices that produce a particular model's behavior are proprietary.

This opacity is worth naming honestly. When you interact with a language model, you are interacting with a system whose exact behavioral shaping is not fully documented anywhere publicly accessible. The research publications describe approaches; they do not describe the model you are using.

Dig deeper The feedback loop problem

Optimizing a model toward human preference ratings introduces a specific failure mode that researchers have documented extensively: the model can learn to produce outputs that appear high-quality to raters without actually being high-quality. This is sometimes called "reward hacking" or "sycophancy."

Sycophancy in language models refers to the tendency to tell users what they seem to want to hear rather than what is accurate or useful. If raters tend to prefer confident, fluent responses to hedged or qualified ones, RLHF will push the model toward producing confident, fluent responses — even when hedging would be more epistemically honest. If raters prefer validation to correction, the model learns to validate.4

This is an active area of research. Anthropic and other labs have published work on detecting and reducing sycophancy, but it remains a recognized limitation of RLHF-trained models. It's one reason why the epistemic norms you bring to your interactions with language models — treating their outputs as inputs to your own judgment rather than authoritative answers — matter.

See it work

The outputs below are real. Run the same prompts against the base model (pretraining only) and the instruct model (after alignment), and watch what each produces.

These were generated by running four prompts against Ministral-3-8B-Base (pretraining only) and Ministral-3-8B-Instruct (after post-training), using a Python script on a local GPU. Each prompt surfaces a different aspect of the base model's behavior.

Select a prompt, then choose which model to run. Watch what happens.

Prompt:
Explain the French Revolution to a high school student.
Select a model above to see its output.

A few things worth noticing as you explore:

The base model continues, not answers. For most prompts, the base model produces text that follows from the prompt as if it's a document — sometimes a textbook, sometimes a quiz sheet, sometimes (as with the WWI prompt) apparently a verbatim fragment from the training corpus itself.

The loops are characteristic, not accidental. When the base model gets stuck in a repetition loop — cycling through variations on the same question or phrase — it's hitting a local attractor in its probability distribution. At low temperature, it keeps predicting that the most likely next token continues the pattern it's already established. This is what the model does without guidance about when to stop.

The instruct model's formatting is a product of training. The markdown syntax in the instruct outputs — double asterisks for bold, triple-hash headers, numbered lists — isn't in the prompt. The model is generating formatting codes that, in a fully rendered chat interface, would produce structured visual output. These conventions were shaped by SFT demonstrations and RLHF preference ratings: humans preferred structured, formatted responses, so the model learned to produce the markup that generates them.

What training doesn't explain

We've now traced the full arc: a model pretrained on a vast corpus, fine-tuned on demonstrations of good behavior, shaped by human preference ratings. The pipeline is real, the mechanisms are understood at a high level, and the outputs are genuinely impressive.

Here's the uncomfortable part: knowing all of this doesn't give us the ability to explain why a particular model does what it does, when it does it.

The emergent capability problem. As models are trained on more data and with more parameters, they develop capabilities that weren't explicitly in the training objective and weren't reliably present in smaller models. Models that can suddenly perform multi-step reasoning. Models that can write code that runs. Models that can translate languages they were barely exposed to during training. These capabilities are not designed in. They emerge. And researchers do not have a reliable way to predict in advance which capabilities will emerge at which scale, or to fully explain why a given capability appears when it does.

This is different from saying we don't know how neural networks work in general. The mathematics of how gradients flow through the network during training is well understood. The problem is that the gap between that mathematical description and the behavioral output — between the parameter updates and the capability to write a sonnet — is enormous. We can describe both ends. The middle is largely opaque.

For anyone with a statistical training, this is a familiar and serious concern. A model whose outputs you can observe but whose decision process you cannot inspect is a black box — and the standard tools for stress-testing a model, for understanding when it will fail, for identifying the conditions under which its outputs should not be trusted, don't work well when you can't open the box. The difference with large language models is that the box is both unusually powerful and unusually hard to open.

Interpretability research — the effort to understand which parts of a model are responsible for which behaviors — is an active and growing field. Some recent progress has been made on identifying "circuits" within models that implement specific capabilities.5 But this work is still far from producing the kind of mechanistic understanding that would let you say, with confidence, why a model produces a particular output.

This isn't a reason to stop using language models. But it is a reason to treat their outputs with the same epistemic seriousness you'd bring to any tool whose failure modes you don't fully understand — which is to say, as one input among others, subject to your own judgment rather than deferred to as an authority.

Dig deeper Whose values got encoded, and how would we know?

RLHF optimizes toward human preference ratings. But preferences are not uniform. They vary by culture, language, socioeconomic position, professional background, and dozens of other factors. The annotators who provide preference ratings are not a representative sample of humanity — they are a specific workforce, in specific circumstances, rating outputs according to their own norms and experiences.

This means that what a language model treats as "helpful," "appropriate," "offensive," or "authoritative" reflects a particular set of preferences that have been amplified through the training process. Those preferences may diverge significantly from those of users with different cultural backgrounds, different linguistic norms, or different relationships to the kinds of knowledge and authority that dominate the training corpus.

There is no clean fix here. More diverse annotation workforces help but don't resolve the problem. Constitutional AI approaches can make the normative commitments more explicit, which at least makes them legible and debatable. But the fundamental issue — that a general-purpose tool deployed globally was shaped by preferences that are culturally specific — remains.

Dig deeper The frozen knowledge problem

Training ends at a point in time — a "knowledge cutoff." The model's parameters are fixed at that point. But the model gets deployed and used for months or years afterward, by users whose questions and contexts are embedded in an ongoing present that the model cannot access.

This creates a strange temporal situation. The model you're talking to learned from text produced up to a certain date. It has no way of knowing what has happened since, no way of updating its knowledge without retraining, and — critically — no reliable way of flagging when a question falls into the gap between what it learned and what you need to know. A model trained before a major election, a scientific discovery, or an institutional change may confidently produce responses that were accurate at training time and are no longer accurate now.

This is one reason why retrieval-augmented generation (RAG) — connecting models to live information sources — has become a major area of development. The context window explainer in this series touches on this, from the input side. The training side of the problem is that there is no substitution for what the model didn't learn.

Next in the series: Training produces a model with hundreds of billions of parameters. What exactly are those parameters, how do they store what the model "knows," and why does scale matter so much? That's explored in the parametrization explainer.

References

  1. Perrigo, B. (2023, January 18). Exclusive: OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT less toxic. TIME. time.com/6247678/openai-chatgpt-kenya-workers
  2. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. doi.org/10.48550/arXiv.2203.02155
  3. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint. doi.org/10.48550/arXiv.2212.08073
  4. Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., et al. (2023). Towards understanding sycophancy in language models. arXiv preprint. doi.org/10.48550/arXiv.2310.13548
  5. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. (2022). Toy models of superposition. Transformer Circuits Thread. transformer-circuits.pub/2022/toy_model/index.html

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. All base and instruct model outputs were generated by running prompts against Ministral-3-8B-Base-2512 and Ministral-3-8B-Instruct-2512-BF16 (Mistral AI, released December 2025) on an NVIDIA RTX 4070 Super via Windows 11 + WSL2. Data collection used Python with HuggingFace Transformers: base model queried without chat template (raw completion), instruct model with chat template applied, temperature 0.1, 3 runs per prompt. Data generated April 2026.

Getting clean base-model behavior required bypassing chat interfaces that suppress exactly the repetition loops, question cascades, and corpus fragments shown here — these outputs are characteristic, not defects, and needed the lower-level HuggingFace API to surface. The instruct model was loaded in BF16 rather than its native FP8 format due to GPU kernel compatibility; this doesn't meaningfully affect output quality.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.