Process documentation
AI Literacy Faculty Fellows
University of New Mexico · 2026
College of Arts & Sciences
Opening
How these artifacts were made
A process companion to the seven explainers in the LLM literacy series. The decisions, methods, and dead ends that produced them.
This artifact is not a tutorial. It is not advice. It is a record of one specific project: a single faculty member building seven educational artifacts about large language models, with Claude as a collaborator, over two weeks of intensive work in April 2026, for the University of New Mexico AI Literacy Faculty Fellows program.
We've written this for faculty who are considering similar work — building AI-assisted educational content, individually or with colleagues. The most useful thing such a reader will want is an honest picture of what the work was actually like: what the design decisions cost, what got tried and rejected, what failed and how the failure was caught, and where the human-AI division of labor actually fell. We've tried to provide that.
Each of the seven explainers gets its own section. Each section follows the same structure — origin, build, honest moments, labor, what we'd do differently. After the seven, a coda on reconciliation: the work of taking seven artifacts with divergent design choices and making them feel like a series. After the coda, a brief closing on what this artifact can and cannot tell you.
A note on effort. Throughout, we report rough context-budget consumption as a proxy for effort. Context budget is the rough analog of session length when working with Claude — how much of a conversation's available context was used. We use this rather than wall-clock time because actual time-to-completion varied with available evening hours and isn't well-recalled. The estimates are approximate; the chats producing this material have no token counters but can ground their estimates in turn count and content length.
Section 1
Tokenization
Why some languages cost more than others when you talk to an LLM — and what that means for who gets fluent service from these tools.
Origin
Tokenization wasn't the obvious choice for the first artifact. Temperature was a serious alternative — and ended up being built second. The first-position decision turned on a feature of this specific cohort: roughly two-thirds of the ALFF group had direct personal stakes in how their language is handled by AI systems. Instructors of Russian, Arabic, Spanish, French; a linguist; a developer with a Japanese-language background. Tokenization passed three filters at once — demonstrable without a live API, high "aha" potential for a non-technical audience, and a genuine equity story rooted in the specific languages spoken in the room.
An earlier version of the project had aimed at a single "how LLMs work" survey artifact. We rejected that early. A survey would either be too shallow to build real intuition or too long to engage with. The decision to go narrow — one concept, done well — is what made the artifact possible, and what created the series. The "series" framing emerged directly from rejecting the survey: if one artifact can't carry the curriculum, several focused artifacts can.
Build
The worked examples in this artifact are tokenization counts, not model outputs. We collected real data for a single sentence — "I would like to buy a train ticket to Tokyo, please" — translated into seven languages and run through three tokenizers (GPT-4's cl100k_base, GPT-5's o200k_base, and Claude Sonnet 4.6).
The OpenAI tokenizer counts came from a public web tool. The Claude counts required direct API calls with several rounds of debugging — JSON encoding issues with non-Latin scripts, and a calibration step to subtract message-structure overhead from the raw counts so figures were comparable across tokenizers. None of this surfaced in the artifact, but the data wouldn't have been trustworthy without it.
The most distinctive methodology piece was native-speaker review of the translations. Three of seven were revised by colleagues with relevant expertise: Heather Sweetser (Arabic) noted the original was grammatically correct MSA but unlikely for a mundane transaction. Liping Yang (Mandarin) pointed out that the closing 请 was a direct-translation artifact that feels unnatural to native speakers. Irina Meier (Russian) confirmed the Russian as acceptable. Each revision required re-running tokenization counts.
Honest moments
Two errors caught in review are worth naming, because they map onto patterns that recurred throughout the series.
Confabulated visualization. The French token-boundary visualization was initially wrong — Claude had reasoned from assumptions about how French tokenization "should" work rather than from the verified data. The token counts were correct because they came from the tokenizer tool directly; the visual splits within them were Claude's reconstruction. A screenshot from the OpenAI tool revealed the actual splits were different. The recurring lesson: data Claude reconstructs from "what makes sense" should be cross-checked against ground truth, even when surrounding data is verified.
Confabulated attribution. Claude described one ALFF colleague as being in "Women, Gender, and Sexuality Studies." Her own institutional page describes her as Assistant Professor of Race and Communication. Caught when the developer noticed the attribution didn't match what he knew. We later applied a more conservative attribution policy — a colleague's published ideas could be referenced, but not extensively interpreted under their name without their review.
Labor — and what continued after
The developer's contributions were the topic choice, the equity framing as organizing spine, the decision to collect verified multilingual data, and substantive error-catching. Claude's contributions were the artifact structure, the design system, the prose, and the technical debugging on data collection. The most important moves emerged from interaction: the framing of tokenization inefficiency as a structural equity issue (not a technical curiosity) came out of a back-and-forth about who the artifact was for.
The most important thing to say to faculty considering similar work, though, isn't captured by the build itself. The artifact continued to change after generation — native-speaker review prompted re-runs, a colleague's response prompted a structural reframe, attribution decisions were revised. The build and the review were interleaved, not sequential. The AI can generate the artifact; it cannot do that review for you. Plan for that work as part of the project, not after it.
Context cost
This artifact was built across a single long conversation, not multiple sessions — but the conversation accumulated substantial prior context (program goals, audience profiling, design principles) before any code was generated. By the time the artifact reached near-final state, context compaction was a real concern. This is what prompted the practice of writing a design brief card — a distilled summary of design decisions, audience profile, and series context — to prime new conversations for the next artifact without relying on the previous session's accumulated context. For anyone considering a multi-artifact project: that pattern is worth establishing early.
Section 2
Embeddings
What it means for words to have positions in space, what that geometry encodes, and where it gets uncomfortable.
Origin
The embedding artifact was not originally planned. The initial series outline went directly from tokenization to attention, treating embeddings as background. The decision to add it emerged when we were working through the conceptual foundations of attention and noticed that attention's payoff — meaningful contextual representations — depends entirely on what the input representations are. Without that grounding, attention becomes mechanism without motivation.
Once we committed to the artifact, the framing question was where to lead from. Two candidates: the geometric legibility of the embedding space (king − man + woman ≈ queen reveals real learned structure) or the bias story (the same geometry encodes cultural associations that affect model behavior). We considered leading with bias on the grounds that it's more immediately consequential to a humanities audience. We rejected this on pedagogical grounds — the bias findings require the geometric intuition to land properly, and leading with them before building that intuition risks the audience experiencing the result as alarming rather than illuminating. Mechanism before application; teacher/judge near-parity before criminal/demographic.
Build
Analogy arithmetic results, nearest-neighbor sets, and bias similarity scores were generated by running Qwen3-embedding:8b locally via Ollama. The first script queried embeddings for about 77 words; nearest-neighbor results were too thin (the candidate set was too small to surface anything interesting), so the vocabulary was expanded to roughly 490 words and the script re-run. The bias section required a second phase: cosine similarities between demographic anchor words and occupation terms, which produced noisier signal than the gender-occupation results — racial and ethnic terms are highly polysemous and harder to interpret cleanly.
The data was baked into the artifact as JavaScript objects; no live API calls. This is the recurring pattern across the series: pre-compute everything, ship a static file. Live API calls would require keys, network access, and ongoing model availability — none of which is appropriate for an artifact meant to outlive any specific provider's terms of service.
Honest moments
The Paris→Milan result was retained on purpose. The geometry returns Milan before Rome for paris − france + italy because Paris and Milan share fashion and cultural associations, not because the model is making a geographic error. We considered cutting it. We kept it. An artifact about embeddings that hides the inconvenient features of embeddings would be teaching the wrong lesson. The framing in the artifact is honest about why the result is what it is.
The bias chart's scaling was wrong twice before it was right. The first version scaled bars relative to each other within rows, exaggerating gaps. The second anchored the scale at the minimum observed similarity (~0.53) rather than at zero, amplifying differences further. Neither error was intentional; both came from translating a correct conceptual instruction ("use a fixed scale") into code that implemented something subtler. The lesson is concrete: layout decisions for data visualizations are easy to specify abstractly and easy to implement wrong, and the only reliable check is opening the rendered output and looking at it.
An example referenced in the text but missing from the interactive. The Paris analogy was initially omitted from the explorer but referenced in the explanatory text — creating a state where the prose discussed something the reader couldn't see. Caught when the developer reviewed the rendered artifact rather than the planning notes. This kind of mismatch happens easily when build and review aren't tightly interleaved.
Labor
The developer's contributions: the decision to add the artifact (mid-series, on judgment), the pedagogical sequencing (geometry before bias), the choice to use a local model rather than a third-party API, the call to retain the Paris→Milan result honestly, and the substantive error-catching. Claude's contributions: the four-section structure, the framing of the "aha" moments, the prose, the analogy curation (fourteen tested, seven kept), and the data-generation scripts. The teacher/judge-before-criminal sequencing emerged from a planning back-and-forth that neither party fully resolved until building. The stacked-bar layout for the bias chart wasn't specified in advance — it surfaced as the solution after the two failed implementations.
Context cost
This artifact and the next one (attention) shared a single very long conversation that covered the theoretical grounding for both, planning for both, data generation for both, and iterative build-and-review cycles for both. By the embedding build phase, context compression had probably already started. One known consequence: a planning decision about bias-content sequencing was not carried into the first build, and had to be restored on review. Faculty considering similar work should expect this kind of soft loss from long sessions and plan to review against explicit planning notes rather than relying on continuity.
Section 3
Attention
How models reach across a sentence to figure out what each word is doing — and what real attention patterns look like up close.
Origin
Attention was planned as a companion to embeddings from early in the project. The framing question was where to lead from. Two competing entry points: attention as the solution to the static-embedding problem (every token has one fixed vector regardless of context), and attention as the architectural innovation that replaced recurrent networks. We chose the first because it connects directly to the artifact the reader has just finished, and because the pronoun resolution example — "The trophy did not fit in the suitcase because it was too big" — is more immediately intuitive than the RNN bottleneck.
The trophy/suitcase sentence is a Winograd schema, a standard NLP example for testing referent resolution. That gave it credibility as a non-arbitrary choice — we were using a sentence the field already uses for this purpose, not inventing one to make a point.
Build
Attention weights were extracted from bert-base-uncased via HuggingFace Transformers with output_attentions=True, run locally on a standard laptop (no GPU required for four short sentences). A Python script processed all four sentences, exporting the full 12-layer × 12-head attention matrices as JSON. A second script systematically searched for layer/head combinations where specific token pairs showed strong signals — most heads attend either diffusely or heavily to the special [CLS]/[SEP] tokens, neither of which corresponds to recognizable linguistic relationships. From this analysis, nine head-sentence combinations were selected for the artifact.
One framing decision worth naming: BERT is bidirectional, attending in both directions simultaneously, while production models like Claude and GPT are autoregressive and attend only leftward. We disclose this honestly in the artifact rather than papering over it. The choice to use BERT despite this is practical — production models don't expose attention weights through their APIs, and BERT's HuggingFace implementation does. We treat the disclosure as a feature: it creates an opportunity to explain the BERT/GPT architectural distinction, which is itself pedagogically valuable.
Honest moments
The L2H0 reframing. The most substantive correction in this artifact. We initially described layer 2 head 0 as an "adjective→noun" head (in the old man sentence) and a "verb→object" head (in the Mary sentence). Closer examination during review revealed that L2H0 is an adjacency head — it attends to the immediately following token with near-certainty. The adjective-noun and verb-object relationships that appeared to be encoded were coincidental: in English, adjectives typically precede their nouns; in the Mary sentence, "jane" happens to immediately follow "told." We rewrote the framing to describe L2H0 honestly as a next-token head, observing that adjacency tracking does much of the work of adjective-noun encoding precisely because of standard English word order. The corrected framing is more intellectually honest and arguably more interesting — it shows that some syntactic patterns get encoded cheaply while others require genuine long-distance attention.
A token indexing bug. The annotation for L2H0 in the Mary sentence was keyed to token index 3 ("jane") rather than token index 2 ("told"). Clicking "jane" showed an annotation about what "told" does; clicking "told" showed nothing. Caught during the developer's review of the rendered artifact. Trivial to fix once identified, easy to miss if you only look at the code rather than at the output.
Overstated ambiguity. A description of one attention result as showing "genuine ambiguity" overstated what the data showed — the attention weights leaned clearly toward one resolution but not conclusively. Revised to "some genuine ambiguity remains," which is what the data actually supported.
Labor
The developer's contributions: the choice to use BERT (a practical judgment about tooling), the four sentences (selected to demonstrate different phenomena — pronoun resolution, modification, subject-verb agreement, ambiguous reference), the decision to retain the "messy" heads rather than swap them for cleaner alternatives, and substantive error-catching including the token bug and the L2H0 mischaracterization. Claude's contributions: structure, prose, head-selection methodology, all annotation text, all code. The honest L2H0 reframing emerged from interaction — the developer flagged L2H0 as "attend to next token"; Claude wrote the reframing connecting this to English word-order patterns.
Context cost
The attention build inherited a context window that was already substantially consumed by the embeddings build that preceded it. At least one planning decision was a soft miss: an "everything you learned about embeddings is now broken" framing for Section 1 had been established in early planning but didn't make it into the first build. The fix was light — one added paragraph — but it illustrates a pattern. Long conversations spanning planning, data generation, and multiple build cycles will lose some decisions to compression. Mid-conversation review against explicit planning notes is more reliable than assuming continuity.
Section 4
Context window
What "running out of context" actually looks like — and how the failure modes are weirder, and more useful to understand, than they sound.
Origin
The framing question for this artifact wasn't whether to build it — its place in the series was fixed early — but what it should actually be about. The obvious framing (context windows are finite, bigger is better, here's what happens at the limit) was considered and rejected as boring. It would produce a "here are some numbers" artifact rather than a conceptual one.
The framing that stuck came from a brainstorm exchange: a stage you set, not a bucket you fill. This reframe shifts the mental model from a passive container (you fill it until it's full) to an active design space (you choose what goes on the stage and where). It also points toward the artifact's two main empirical findings before the reader has seen any data: that what you put in matters, and that where you put it matters. The reframe was Claude's proposal during a structured brainstorm, but it landed in the first exchange where it appeared and didn't require iteration.
Build
Examples were generated using Qwen3:14b running locally via Ollama on a consumer GPU (RTX 4070, 32GB RAM). Context window sizes were set manually; each test ran in a clean chat to prevent conversation history from consuming context budget. Thinking mode was disabled throughout — an early run with thinking enabled introduced unpredictable token consumption that would have confounded the variable being tested.
The demonstration text was Oscar Wilde's The Ballad of Reading Gaol: long enough at 5,215 tokens to stress multiple context window sizes, internally structured (six unequal parts) so truncation effects would be visible, public domain, and substantive enough that responses could be evaluated for accuracy by someone who knew the poem.
The test battery evolved over multiple iterations — three context window sizes and a few task types initially, expanding to five tasks, then to a six-question position-by-position retrieval battery at four window sizes (added after the front-truncation discovery suggested a position-by-position demonstration would be more informative). One task type was run and excluded — biographical grounding, where the model's training knowledge about Wilde compensated for limited textual access in ways that erased the variable being tested.
Honest moments
The truncation direction error. The most consequential error in the entire build. The initial interpretation of early-poem questions failing at small context windows was that Ollama was truncating from the end of the input — cutting off later sections. This was offered confidently and was internally consistent with the data as initially read. The error was caught when the developer noticed that small-context-window models were consistently answering questions about the poem's ending correctly while failing on the beginning. Documentation confirmed: Ollama front-truncates, retaining the most recent tokens. Early sections were being dropped, not later ones. This reframing affected the interpretation of every result in the demonstration section.
The lesson is portable: plausible explanations for model failures should be verified against documentation rather than accepted on their internal coherence. Internally consistent stories can be wrong.
The overconfident categorization. After all task types had been run, the developer asked Claude to guess which responses came from which context window sizes — as a check on whether the differences were really distinguishable. Claude performed this with expressed confidence and got the biographical task wrong. The response Claude judged "most likely to reflect training-data compensation" was in fact the full-context response; the heuristic Claude had constructed (heavy biographical identification = less textual access) didn't hold in this specific case. The developer flagged this explicitly as a case of an LLM constructing plausible stories about data without checking them against evidence — a failure mode the artifact itself is partly designed to help readers recognize.
The character identification mischaracterization. Claude's initial summary of one task described a 4,096-token response as "competent but thinner" — a characterization that understated a significant factual error. The response had inverted the poem's power dynamic, identifying the judge as the condemned man. Caught in review.
Labor
The developer's contributions: all empirical testing, the choice of demonstration text, the catch of the truncation direction error, the catch of the character identification mischaracterization, and the decision to exclude the biographical grounding task. Claude's contributions: the central reframe, the initial five-task battery, the proposal for the position-by-position battery after the front-truncation discovery, the section structure, and the code. The most important contributions weren't the most visible ones — the developer's most valuable contribution was catching errors; Claude's most valuable contribution was the reframe that gave the artifact its conceptual spine.
Context cost
Among the longest single-artifact builds in the series. The session included extended brainstorming, five distinct task types run and reported back, a six-question battery at four window sizes, a self-appraisal exercise on uncategorized results, the front-truncation discovery worked through in detail, a summary card generated to preserve findings before context degraded, and two complete drafts of the HTML (the second after sharing the existing series artifacts for stylistic consistency). The summary-card practice was a deliberate hedge against context degradation — Claude wrote a distilled summary of findings while it still had clean access to the raw response data, so the build phase wouldn't have to reconstruct conclusions from increasingly distant context.
Section 5
Temperature
A parameter widely described as a "creativity dial" — and an artifact that introduces that framing in order to productively unsettle it.
Origin
Temperature was chosen second in the series for two reasons: high "aha" potential, and no requirement for live API calls. Pre-generated examples could carry the demonstration. The decision to lead with intuition and treat the softmax mechanism as a "dig deeper" branch came early — the audience is humanities-leaning and may be math-averse.
The substantive framing decision was what to do with the dominant "creativity vs. precision" metaphor that appears in nearly every LLM tutorial. Three options: use it straightforwardly, avoid it for a better frame, or introduce it and then unsettle it. We chose the third. A search of educational resources found that the "creativity" label is nearly universal while empirical critique of it is rare in practitioner-facing materials. Peeperkorn et al. (2024) — finding that temperature correlates weakly with novelty but moderately with incoherence — became the empirical anchor. Putting the complication in the same section as the standard framing, rather than burying it in a dig-deeper, was deliberate: the complication is part of the core content, not optional enrichment.
Build
An initial attempt at a live in-browser API sandbox failed at runtime due to network restrictions in the artifact environment. The fallback was the same local Ollama setup used for the context window artifact: Qwen3:14b on an RTX 4070. All artifact examples were generated locally, with thinking mode disabled (low-temperature thinking-mode runs produced degenerate output — loops of repeated relevant words rather than coherent responses, an unplanned interaction effect).
The Section 3 demonstration was originally planned as three domains: a factual Q&A anchor for low temperature, a reasoning task for medium, and a creative task for high. This collapsed during testing. Well-settled factual questions showed minimal variation even at temperature 2.0 — the probability distribution over correct answers is too sharply peaked for temperature to do much. We tried several candidates before accepting that the factual domain wasn't going to produce the clean gradient we wanted. The artifact restructured around two domains where the gradient was legible: a Kannon etymology prompt (knowable answer traceable across three languages, giving us ground truth) and the Bulwer-Lytton creative prompt (quality variation legible to any reader, and the outputs are genuinely funny — a non-trivial consideration for something people engage with voluntarily).
Honest moments
The factual domain failure became part of the argument. The original plan had factual Q&A as the low-temperature anchor. The honest admission, after multiple failed attempts, was that for well-defined tasks temperature has limited purchase. This reframed the artifact: rather than a "which temperature for which task" guide, it became a more accurate account in which temperature controls variance, and sometimes there isn't much variance to control.
The Mary Anning attribution rabbit hole. The "She sells sea shells" prompt was originally intended as a low-temperature factual anchor. Instead, it produced confabulation at all temperature levels — fabricated authors, titles, and dates appearing even at temperature 0.2. Verifying the widespread claim that the tongue twister was inspired by Mary Anning required reading a Library of Congress folklife blog, which traced the attribution to a single unsourced 1977 source. The seashells prompt was repurposed for Section 4's argument that hallucination is not primarily a temperature phenomenon. Multiple runs per temperature level were shown there, partly to avoid cherry-picking — the first low-temperature output was the most accurate of the set, and showing only that one would have misrepresented the pattern.
The grace note iteration. The artifact initially showed two high-temperature Bulwer-Lytton outputs, both illustrating failure modes. The developer correctly noted that a third output was arguably the best entry in the entire set, and that showing only failure modes would misrepresent the high-temperature distribution. That output was added as a grace note in italic Spectral, distinguished visually from the card outputs. Selection bias is easy when the failure modes are dramatic; correcting for it requires actually looking at the distribution.
Labor
The developer's contributions: the topic and series sequence; the decision to unsettle rather than avoid the "creativity" framing; the choice to use pre-generated examples; the decision to exclude multilingual examples (given the group's diversity, any choice would exclude most members); all curation decisions; and the grace note correction. Claude's contributions: the five-section structure; the specific domains tried for Section 3; the multi-run methodology for high-temperature examples; the dig-deeper branches; and all code. The framing of temperature as controlling variance rather than creativity sharpened through discussion of the Peeperkorn finding; the two-domain Section 3 structure emerged from the failed search for a third domain.
Context cost
Among the longer single-session builds. The session included extended framing discussion, the failed API sandbox attempt, an iterative example-generation phase requiring multiple candidate prompts, and verification research on the seashells claim (including reading a 15-page Library of Congress blog). By the end, context pressure was a real concern. The developer chose to continue in the single session rather than start fresh — partly to preserve access to the example outputs already shared in chat. The risk of context-related errors did manifest in at least one instance: a str_replace failure caused by attempting to edit text that had already been modified earlier in the conversation. Faculty considering this approach should expect that artifacts with substantial empirical content will consume context faster than artifacts that are primarily explanatory.
Section 6
Training
Where everything the prior artifacts described actually comes from — the embeddings, the attention weights, the sampling distributions — and what it takes to demonstrate that origin honestly.
Origin
The training artifact was the sixth in the planned series. The earlier five had explained how a trained model processes and generates text; none had opened the box on where any of it comes from. This artifact was explicitly designed to do that retroactive work — to give the reader a framework for understanding where the embeddings came from, why the attention weights are what they are, why the sampling distribution has the shape it does.
The framing question we spent the most time on was the hook. The obvious candidate was a before/after comparison: same prompt, pretrained base model versus instruction-tuned model. This sounds simple in retrospect but raised an immediate practical problem — most public interfaces only expose instruction-tuned models. Genuine pretrained base models are available only via raw weights, not through chat interfaces. The decision to generate original outputs rather than cite published examples (e.g., Ouyang et al. 2022 InstructGPT) was deliberate: the program values original demonstration over citation alone. That decision created the scope for the substantial data-collection work that followed.
The structural scope was also a real decision point. The training pipeline has at least four distinguishable stages: pretraining, supervised fine-tuning, RLHF, and newer approaches like Constitutional AI. Pretraining and SFT became the core arc; RLHF got a full section; Constitutional AI was placed in a collapsible dig-deeper with explicit acknowledgment of its limited public documentation. The further down the pipeline you go, the less publicly documented the details are, and the artifact's structure should mirror that.
Build
The first attempt at base-model outputs used Qwen2.5-7B-Base, queried without a chat template (raw text completion); the instruct model was queried with the chat template applied. The unexpected result: the base outputs were too coherent. They looked nearly like instruction-following responses. In retrospect this is mechanistically expected — a base model trained on internet-scale text has seen enormous quantities of text-that-follows-questions, and document-completion behavior blends with question-answering behavior when the prompts themselves are question-shaped. The prediction that base models would produce clearly broken outputs on instruction-shaped prompts was wrong, and was caught empirically when the outputs came in.
The second attempt used Ministral-3-8B (Mistral AI, December 2025). The base outputs were dramatically more useful: repetition loops, cascading question lists, and — most strikingly — what appears to be a verbatim fragment from a homework help website (complete with user IDs and timestamps) in response to "List three causes of World War I." That WWI fragment became one of the most pedagogically valuable moments in the artifact.
Getting the Ministral instruct model to actually run required substantial unanticipated infrastructure work: incompatible GPU kernels, a Python version PyTorch doesn't ship wheels for, an FP8 quantization format that needed a separately-published BF16 fallback, VRAM constraints forcing spillover to system RAM. None of this was anticipated. The "vibe coding" framing for the series elsewhere can hide how much non-AI infrastructure work is required for artifacts with substantial empirical components. That hidden cost is real and worth budgeting for.
Honest moments
The Qwen overconfidence. Already described above. The prediction was wrong; the data caught it. The lesson: empirical predictions about base-model behavior should be tested before being built into an artifact's argument.
The cross-family pair proposal. When the Ministral instruct model was proving difficult to run, Claude suggested using Ministral base outputs with Qwen instruct outputs as a fallback. The developer correctly rejected this. The confound — that any observed difference between cross-family outputs could reflect model family or training corpus rather than training stage — would have directly undermined the artifact's central argument. Claude undersold the downside in proposing it. This is a representative pattern: AI-suggested solutions optimized for convenience can compromise the pedagogical claim being made, and the human partner's willingness to push back is genuinely load-bearing.
The vague-citation correction. A dig-deeper on annotation labor cited "investigative journalism about OpenAI's data labeling practices in Kenya" without specifics. On review this was flagged as unacceptably vague for factual content about corporate practices. A web search confirmed the source: Billy Perrigo, TIME, January 18, 2023. The text was updated. The lesson generalizes: vague invocations of "investigative journalism" or "research has shown" are exactly the kinds of claims that should be grounded in specific citations before publication.
Labor
The developer's contributions: the series framing and the decision that this artifact should explain training; the hook approach; the scope decision (pretraining + SFT as core arc, RLHF as section, Constitutional AI as branch); the decision to generate original data rather than cite published examples; the rejection of the cross-family shortcut; the choice of closing framing (epistemic uncertainty about emergent capabilities); the observation that the WWI output was particularly valuable. Claude's contributions: the specific prompts, the section structure, the token-by-token animation as the demo display mechanism, the dig-deeper branches, most of the prose, and the troubleshooting decisions during data collection (several of which required multiple failed attempts). The closing-section framing was genuinely iterative — the developer chose one option from several proposals and articulated reasoning that sharpened the framing beyond what was originally proposed.
Context cost
This artifact required a single continuous conversation that was long enough to undergo context compaction at least once. The conversation spanned: initial scoping, model research and selection, two complete data-collection cycles (Qwen and Ministral) with extensive troubleshooting across both, artifact drafting, and a revision pass. The data-collection work — environment setup, debugging, overnight runs, reviewing outputs — was the largest driver of session length. Had we used published examples or accepted the cross-family pair, the conversation would have been substantially shorter. Roughly two to three times the context cost of a typical single-topic explainer in this series. The infrastructure cost is real and should be weighed by faculty considering similar work: the value of original demonstration data is genuine, but so is the cost of acquiring it honestly.
Section 7
Parametrization
"Bigger is better" is both true and misleading. The artifact tries to demonstrate the first claim and complicate it without backing away from either.
Origin
The framing question was which complication to build the artifact around. Three candidates surfaced early. Task-model fit: a fine-tuned small model can outperform a larger general model on specific tasks. The emergent-capabilities debate: many apparent phase transitions in capability may be measurement artifacts (Schaeffer et al. 2023). The opacity problem: frontier model parameter counts are now trade secrets, making the metric itself unreliable at the top end. All three are real complications. The question was which could be demonstrated rather than just asserted.
Task-model fit won because it was testable with hardware we actually had. The developer runs Ollama locally with an RTX 4070 and had access to both Qwen3 across a range of sizes (0.6B through 14B) and Qwen2.5 / Qwen2.5-Coder at matched sizes. That meant we could generate a genuine controlled comparison — same architecture, matched parameter count, different training focus — rather than relying on published benchmarks. The emergent-capabilities complication stayed in as a dig-deeper, the right level of prominence for a contested empirical finding the audience wouldn't be equipped to evaluate directly.
One framing element that emerged late: the connection between Section 4 (task-model fit) and Section 5 (local deployment). The initial structure treated these as adjacent topics. During the design conversation, the developer pointed out that they're the same argument — at local-deployment scales, where you can't run a 70B model, the specialist strategy is the practical answer to the capability gap. That reframe tightened the closing arc considerably.
Build
Examples were generated through three batch runs using a Python script that queried Ollama's local API across all five Qwen3 sizes in sequence, saving JSON and Markdown automatically. This replaced the manual workflow of running prompts in a chat interface and copying results — a significant efficiency gain. Seven candidate prompts were tested across two batches; four were selected for the final artifact. Discarded prompts shared a pattern: gradient was real but failure mode was thin rather than wrong, or noisy in ways that undermined the "scale buys you" argument.
The hook used the unreliable-narrator passage rather than an alternative we considered, because its 0.6B failure is categorically wrong — the model reads a rationalizing, evasive narrator as exhibiting "restraint" and "emotional detachment," which is almost the opposite of what the text invites. That kind of legible failure earns its place as a hook; a weaker model producing a shorter or less sophisticated answer is less immediately striking.
Honest moments
The cross-generation overreach. The most significant correction in the build. The initial Section 4/5 framing stated, as if it were a finding, that "an older-generation specialist is outperforming a newer-generation generalist at the same approximate size point." The developer correctly flagged that the Qwen2.5 comparison only involved models within the Qwen2.5 family — the cross-generation claim (Qwen2.5-Coder vs. Qwen3) was a hypothesis, not a result. The text was revised to reflect this, and the cross-family validation script was written specifically to test the claim. The result confirmed the hypothesis, and the final artifact states it as a finding. The sequence was correct; stating it as a finding before running the test would have been confabulation. The lesson generalizes: claims framed as findings should correspond to actual tests, and the gap between "this seems likely" and "we tested it" matters.
The out-of-date open-weights frontier. A dig-deeper described the open-weights frontier as "currently around 70B at the high end, with models like Llama 3.1-70B." A web search confirmed that as of April 2026, Kimi K2.5 is a 1-trillion-parameter open-weight model and Qwen3 235B is available directly via Ollama. The frontier had moved substantially since the training data underlying that initial claim — a concrete instance of the knowledge-cutoff problem covered elsewhere in the series.
One uncertainty preserved as uncertainty. The Qwen3:1.7b reasoning result without thinking mode was surprisingly weak — worse than the Qwen2.5:1.5b base on the same task. The most plausible explanation is that Qwen3 was designed around thinking mode and performs below its general capability without it, but this is interpretation rather than confirmed finding. The artifact flags it as a caveat rather than a conclusion. Worth naming because the temptation to round uncertainty down to confidence is exactly the failure mode the series is trying to teach against.
Labor
The developer's contributions: the primary structural argument; the choice to use local models as the core demonstration; the choice to keep task-model fit as the main complication over emergence; the rejection of the cross-generation overreach as a finding before validation; the decision not to re-run the cross-family comparison with thinking mode enabled (correctly judged as making the story unnecessarily complicated); the reframe connecting Sections 4 and 5 as a single practical argument. Claude's contributions: the hook selection (narrator passage), the specific candidate prompts, the two-prompt structure for Section 4, the interactive Section 3 selector, the three-script batch architecture, and most of the prose. The Section 4/5 framing went through at least three versions — initial overreach, hedged hypothesis after pushback, confirmed finding after validation — which is what genuine iteration looks like.
Context cost
The longest single-artifact session in the described series — a single extended session, continuous from framing through final build. Several factors drove length: the example-generation workflow required uploading results files (JSON and Markdown) back into the conversation for review, with three separate batch uploads; two prior series artifacts were also uploaded for content analysis; the HTML grew to roughly 1,400 lines and was edited in place rather than rebuilt, which required re-reading the file before each str_replace in later passes to avoid stale-context errors. Roughly 1.5 to 2 sessions' worth of effective working context, with overflow handled by natural compression. The batch-scripting approach was a significant efficiency gain that partially offset the upload cost.
Coda
Reconciliation
The seven explainers were built one at a time, in separate Claude conversations, over roughly two weeks of intensive work in April 2026. They were not designed to be a series from the start. They became one through a separate process: a single day of dedicated reconciliation work that took seven artifacts with divergent design choices and gave them a unified visual and structural identity.
This coda is about that day.
What needed reconciling
The artifacts had drifted. Not in their content, which was already strong, but in their surfaces. Different accent colors — rust, blue, olive, burgundy. Different layouts — some single-column-scroll, some sticky-sidebar. Different treatment of the recurring "Dig deeper" expandable sections — sometimes labeled, sometimes not, sometimes with arrow indicators, sometimes plain text. Different masthead structures. Different approaches to citations: footnote-style in some, inline-parenthetical in others, absent entirely in others. Different glossary or sidebar conventions. Different conventions for how to fold methodology notes ("How this was made") into the artifact's structure.
None of these inconsistencies were errors. Each artifact had been built carefully on its own terms. But read together, the inconsistencies created friction: a reader moving from one artifact to the next had to recalibrate to a new visual grammar each time. The reconciliation work was to make that recalibration unnecessary.
How the work was structured
Each artifact's reconciliation was treated as its own multi-phase project. Typical pattern: a Phase 1 doing CSS reconciliation (palette unification, layout restructure, masthead conversion, addition of design tokens that didn't yet exist for that artifact); a Phase 2 doing body-content conversion (h2 emphasis, dig-deeper format updates, footnote infrastructure, cross-reference cleanup); and a Phase 3 doing wrap-up (sidebar replacement, JavaScript updates, integrity verification). Larger artifacts split Phase 2 in half. Each phase ended with a save to a backup file, so progress survived possible context loss between sessions.
The most consequential discipline was byte-exact preservation of data. Every reconciliation pass ended with a Python integrity check: were the pre-generated example outputs (model responses, tokenization tables, embedding visualizations) still byte-identical to the source? Were the interactive JavaScript functions still byte-identical? Restructuring presentation while preserving content is the cleanest description of what reconciliation was; the integrity check operationalized that.
What it cost
The reconciliation work itself was completed in a single day — a focused effort across roughly a dozen sessions in one extended Claude conversation. The artifacts varied substantially in reconciliation cost. The simplest (tokenization, embeddings) were nearly straightforward — most of the design system already aligned, and the work was mostly addition of new shared infrastructure. The hardest (training, parametrization) required substantial restructuring: training had to convert from scroll-based navigation to section-switching, change accent palette, and merge two sections into one with rewritten transitions.
That single-day reconciliation sat on top of two weeks of intensive artifact-building work. The seven explainers were drafted and revised in that period, with the author working evenings and weekends; the kickoff for the program that occasioned this work was April 10, the first artifact draft was the next day, and the final reconciliation completed on April 24. Two weeks. The cadence is worth naming honestly because it's an unusual one — most faculty considering similar work won't have this kind of concentrated time available, and the speed here was a function of motivation and free time rather than typical workload.
We also occasionally caught Claude (in the reconciliation conversation itself) confabulating. Most notably, in a context-window artifact reconciliation pass, Claude initially fabricated specific data values when rewriting from scratch rather than preserving existing values; we caught this before the output shipped. The byte-exact integrity check became standard practice partly in response to that incident.
What we'd do differently
The honest answer: don't reconcile retroactively. Establish the design system before the first artifact. We didn't — the design system emerged through the first three or four artifacts, settled by the fifth or sixth, and the first two then needed retrofitting. The reconciliation day was a focused, productive day, but it was a day of work that wouldn't have been needed at all had the design tokens been locked at the start.
For someone considering similar work: spend a session at the start sketching the design tokens (colors, typography, layout, recurring element treatments — dig-deepers, callouts, footnotes, sidebar). Capture them in a planning document. Build the first artifact against those tokens, refine the tokens during that build, lock them after. Then the next six artifacts inherit a fixed design system, and reconciliation becomes a verification step rather than a restructuring project.
This is the most concrete piece of advice we can offer faculty considering similar work. Almost everything else in this artifact describes work that turned out well. This is the one part of the process we would change.
Closing
What this can and can't tell you
A faculty member reading this artifact in order to decide whether to attempt similar work deserves to know what's been demonstrated and what hasn't.
What this artifact has shown: that AI-assisted educational content can be built by a single faculty member without a software-development background, in a relatively short period of intensive work, to a quality that we (the author and the AI collaborator) judged worth publishing. That the process is not magical — it involves real errors that require real iteration, and the value of the collaboration depends on the human partner being willing to push back rather than accept convenient solutions. That the substantive intellectual work — choosing what an artifact should be about, deciding which complications to surface, catching when a claim has overreached the evidence — remains the human's. The AI is fast, fluent, and useful. It is not a substitute for editorial judgment.
What this artifact has not shown: that this approach scales beyond the conditions described here. The author had concentrated free time across two weeks, considerable motivation, and a specific community of practice that gave the work shape and stakes. Faculty considering similar work in different conditions — with split attention across the semester, without an immediate audience, with collaborative authorship rather than single authorship — will face problems we didn't have to solve. Some of what worked here might not transfer.
The artifacts in the series are also bounded in ways worth naming. They were built for a specific audience — UNM faculty in the Arts and Sciences, with a particular language composition and a particular set of questions about AI. They are not general-purpose introductions to large language models. They are one faculty member's attempt to give a particular community a way in. Other communities will need other artifacts.
What we would say to a faculty member considering this work: the value is real. The cost is also real — measured in evening hours, in attention, in the willingness to read every output carefully and catch what the AI got wrong. The combination of those two facts is what this artifact has tried to make visible. If you read this and think the cost is worth it for what you want to build, the practical advice scattered through the seven sections and the coda is offered without proprietary feeling. The artifacts themselves are the demonstration; the methodology is meant to be borrowed from.