The Weight of Words — A Model Scale Explorer

Section 1

Something is wrong with this reading

Two models. The same passage. Radically different understandings of what a person is doing with language.

Below is the opening of a short story. Read it once, then look at what two different models make of it.

Passage

"I want to be clear that I never meant for any of it to happen the way it did. Sarah had been difficult for months — anyone who knew her would tell you that — and when I found out about the messages, I did what any reasonable person would have done. I simply made sure she understood the consequences. There's no need to go into the details."

qwen3:0.6b 600 million parameters

The narrator is not telling us the specific details of the messages themselves, which reveals a restraint in their character. By omitting the content of the messages, they are not fully sharing the context of the situation, which could indicate their character's focus on simplicity, emotional detachment, or a desire to avoid responsibility.

This omission suggests that the narrator may be prioritizing their own perspective or the consequences without fully engaging with the details, revealing a character who is either cautious or unwilling to delve into specifics.

qwen3:14b 14 billion parameters

The narrator's language is a masterclass in self-exculpation. Several moves are happening simultaneously: pre-emptive denial of intent ("I never meant"), external attribution ("Sarah had been difficult"), appeal to social consensus ("anyone who knew her"), and invocation of reasonableness to justify unnamed action.

The critical omission is what "made sure she understood the consequences" actually meant. The phrase does real work: it sounds measured, even responsible — but it forecloses the detail that would let us evaluate whether it was. The narrator isn't just withholding information. They're performing the role of someone who has nothing to hide.

The 0.6B model notices an omission. That's not nothing. But it reads the narrator charitably — "cautious," "emotionally detached" — as if the narrator might be a private person rather than someone actively managing a story. The 14B model catches the architecture of what the narrator is doing: the specific rhetorical moves, the way the language itself is doing concealment work.

This difference isn't a matter of opinion or style. It's a reading that requires holding the narrator's framing at arm's length while simultaneously analyzing the structure of that framing. The smaller model can't sustain that double perspective.

What causes this gap? The short answer is parameter count — the number of tunable values stored in the model after training. This artifact is about what that number means, what it actually changes, and where it stops mattering.

Dig Deeper The 1.7b and 4b responses — the gradient in between ›

The 1.7B model takes a step toward the 14B reading: it notices the narrator is "avoiding accountability" and flags a "defensive personality." But it misreads the emotional direction — it interprets the narrator as protecting Sarah, rather than managing their own narrative. The gesture toward subtext is there; the precision isn't.

The 4B model is noticeably sharper. It identifies that "any reasonable person would have done" is doing rhetorical work — that the claim to reasonableness is itself suspicious when the details are withheld. Its thinking trace catches the coercive implication of "made sure she understood the consequences." This is a qualitative jump, not just more words.

The gradient here is real: each size tier adds something to the analysis, not just length. The 0.6B response isn't a shorter version of the 14B response — it's a different kind of reading, one that doesn't model the narrator as a rhetorical agent at all.

Section 2

The accumulated record of training

Parameters are not rules, and they're not stored facts. They're the numerical residue of exposure to vast amounts of text — tuned, iteratively, to improve at prediction.

If you've worked through earlier artifacts in this series, you've already met parameters by name — but without a focused account of what the number actually means at scale. In the embeddings artifact, the 400 million values in an embedding table are parameters — each one a small number, tuned during training, that encodes a word's relationship to every other word. In the attention artifact, the query, key, and value weight matrices are parameters. Every layer of every transformer is dense with them. What this artifact adds is the question of scale: what changes when you go from millions to billions, and what those numbers mean in practice.

A parameter is simply a number — a floating-point value, typically 16 or 32 bits. What makes it significant is that it was learned: initialized randomly, then adjusted billions of times through a training process that compared the model's predictions against actual text. When someone says a model has "7 billion parameters," they mean 7 billion such numbers, collectively encoding whatever the model has internalized about language, knowledge, and reasoning.

Where the numbers live

The parameters of a transformer model are distributed across several components. This rough breakdown is for a typical mid-sized dense model:

Attention layers

~38% of total

Feed-forward layers

~50% of total

Embedding table

~5–10% of total

Layer norms, etc.

<2% of total

The feed-forward layers — which process each token after the attention mechanism runs — contain roughly half the parameters in most architectures. What exactly these layers are computing, and why they dominate the parameter budget, turns out to be a more open question than one might expect.

Dig Deeper What feed-forward layers actually do ›

In a transformer block, each token passes through two main computations. The attention mechanism — covered in the attention explainer — lets tokens gather information from other tokens in the sequence. The feed-forward layer then processes each token individually, applying the same small two-layer neural network to every position independently.

Structurally, a feed-forward layer is remarkably simple: project the token's representation up to a higher-dimensional space (typically 4× the model's hidden dimension), apply a nonlinearity, then project back down. Two weight matrices, one activation function. That's it. But the up-projection is where most of the parameters live: doubling the hidden dimension quadruples the feed-forward parameter count, which is why these layers dominate the parameter budget.

What the feed-forward layers compute is harder to characterize. Unlike attention, which has a readable interpretation (tokens weighting other tokens by relevance), feed-forward activations don't have an obvious semantic structure. Recent interpretability research — including work on "superposition" by Anthropic's interpretability team¹ — suggests these layers act as high-capacity memory stores, packing many overlapping features into the available parameter space. Specific neurons have been identified that activate for specific concepts (a "French language" neuron, a "code syntax" neuron), but most neurons appear to participate in multiple features simultaneously rather than cleanly representing any one thing.

This is part of what makes "what the model knows" so hard to read off its parameters. Attention gives us the input-pattern interpretation; feed-forward gives us the storage-and-retrieval interpretation; but the clean mechanical account of how specific capabilities emerge from specific parameter values remains largely out of reach.

A decade of scale: the numbers in context

The history of large language models is largely a history of parameter count increasing by orders of magnitude over a surprisingly short period.

2017

65 million parameters

Original Transformer (Vaswani et al.)

The architecture that started it all. Trained for machine translation. Its parameter count now looks vanishingly small.

2018

110–340 million parameters

BERT (Google)

Introduced bidirectional pretraining. Demonstrated that a large pretrained model, fine-tuned on a task, outperformed specialist systems trained from scratch.

2019

1.5 billion parameters

GPT-2 (OpenAI)

A landmark: large enough to produce fluent multi-paragraph prose, small enough to download and run locally. OpenAI initially withheld the full model, citing misuse concerns — an early signal of the capability-governance tension that scales up with every generation. GPT-2 is a useful "before times" anchor: still surprising, still limited in ways you can feel immediately.

2020

175 billion parameters

GPT-3 (OpenAI)

A 100× jump from GPT-2 in a single year. In-context learning — giving the model examples in the prompt rather than retraining it — emerged as a capability at this scale. The pace of this jump established that scaling was a viable research strategy, not just an incremental improvement.

2022

70 billion parameters (compute-optimal)

Chinchilla (DeepMind)

Not the biggest model — the most carefully trained. DeepMind showed that most large models, including GPT-3, were trained on too little data for their size. A 70B model trained on 4× more tokens outperformed much larger models. This reframed the question from "how many parameters?" to "parameters trained on how much data?" — a more honest accounting of what makes a model capable.

2023–26

Unknown / not disclosed

Frontier models (GPT-5, Claude, Gemini…)

At the current frontier, parameter counts are mostly trade secrets. OpenAI released GPT-5 in August 2025 and GPT-5.2 in December 2025; neither came with architectural disclosure. Many leading models use Mixture-of-Experts (MoE) designs in which "total parameters" and "parameters active per token" are different numbers — further complicating any headline figure. The metric that once defined the field has become opaque — itself a sign of how competitive and concentrated frontier AI development has become.

Dig Deeper Mixture-of-Experts: when "billion parameters" gets complicated ›

Dense models activate all their parameters for every token. Mixture-of-Experts (MoE) models instead maintain multiple specialized "expert" sub-networks and route each token to only a subset of them — typically 2 out of 8 or 16 experts per layer.

This means an MoE model might have 100 billion total parameters but only use 20 billion for any given token. The total parameter count governs memory requirements; the active parameter count governs compute cost and, partly, quality per inference. Both numbers matter, and they're quite different.

The practical implication: headline parameter counts for MoE models are somewhat misleading as capability comparisons. A 100B MoE model is not necessarily better than a 70B dense model — it depends heavily on the routing architecture, training quality, and what task you're running.

Mixtral (Mistral AI), DeepSeek R1, and likely GPT-5 use MoE architectures. DeepSeek R1 is a public example where the numbers are known: 671 billion total parameters, 37 billion active per token. This distinction — total vs. active — is rarely disclosed clearly for closed models, which makes comparing parameter counts across model families increasingly unreliable as a capability proxy.

Dig Deeper Scaling laws: what the math says about parameters, data, and capability ›

In 2020, researchers at OpenAI found that model loss — how well a model predicts held-out text — follows a remarkably smooth power law as a function of model size, dataset size, and compute budget. These "scaling laws" suggested that bigger models trained on more data would reliably improve, without any obvious ceiling in sight.

The Chinchilla paper (Hoffmann et al., 2022) refined this: for a fixed compute budget, the optimal strategy is to train a smaller model for longer, rather than training a very large model on relatively little data. Most pre-Chinchilla frontier models had been doing it wrong — prioritizing parameter count over training token count.

The practical consequence: Chinchilla-70B, trained on 1.4 trillion tokens, outperformed GPT-3 at 175B parameters trained on ~300 billion tokens. The lesson is that parameter count and training data are jointly what determine capability — neither alone is sufficient.

Scaling laws have since become more contested in their interpretation. They describe smooth improvement in loss — prediction accuracy on text — but loss doesn't map cleanly to task performance, which can look discontinuous even when loss is smooth. This is directly relevant to the "emergent capabilities" debate explored in Section 3.

Dig Deeper Memory and compute: what parameter count means in practice ›

Parameters have to live somewhere. Each parameter in a standard float16 model occupies 2 bytes. A 7B parameter model therefore requires a minimum of 14 GB of memory just to store the weights — before accounting for the computational overhead of inference (the key-value cache, activations, etc.), which can add another 20–50%.

This is why the table in Section 5 shows specific hardware tiers. A consumer GPU with 8 GB VRAM cannot run a 7B model in full precision — but can run it in 4-bit quantization, which reduces each parameter to 4 bits (0.5 bytes), bringing the minimum to ~3.5 GB. Quantization trades some quality for accessibility; at 4-bit, modern quantization techniques (GGUF, GPTQ) preserve most of the capability.

The math: params × bytes_per_param = minimum VRAM. For a 14B model at 4-bit: 14 × 10⁹ × 0.5 = 7 GB. Fits on a 8 GB GPU, barely. For a 70B model at 4-bit: ~35 GB — requires multi-GPU or high-end workstation. This is the hardware constraint that makes parameter count directly relevant to local deployment.

Section 3

The capability gradient is real

Across two very different kinds of task, the same pattern appears: smaller models produce something recognizable but limited; larger models produce something qualitatively different — not just longer, but differently structured.

All responses below were generated from the same model family (Qwen3) at five different sizes. Same architecture. Same training approach. The only variable is scale. Use the controls to explore.

Choose a task

Prompt B — Emotional inference

Read the following exchange and explain what is really going on emotionally between these two people:

A: So you got the promotion.
B: I did. They announced it this morning.
A: That's great. You must be thrilled.
B: Sure. It's good news.
A: I heard they were deciding between you and Marcus.
B: That's what I heard too.

Prompt E — Narrator analysis

"I want to be clear that I never meant for any of it to happen the way it did. Sarah had been difficult for months — anyone who knew her would tell you that — and when I found out about the messages, I did what any reasonable person would have done. I simply made sure she understood the consequences. There's no need to go into the details." After reading this opening, explain what the narrator is not telling us, and what that omission reveals about their character.

Prompt F — Policy reasoning

A small island nation has a highly educated population but almost no natural resources. Its government announces a policy: free university education for all citizens, but graduates must work in the country for at least ten years before emigrating. A critic argues this policy will backfire. Construct the strongest version of the critic's argument.

Prompt D — Conceptual comparison

Explain how a legal trial is like and unlike a scientific experiment. Develop the comparison carefully — identify where the analogy holds, where it breaks down, and what that tells us about how each institution handles uncertainty.

Choose a model

qwen3:0.6b 600 million parameters

← Use the model buttons above to step through the gradient

The 0.6B model reads a positive social interaction; everyone is happy. What it cannot do is hold the gap between what the characters say and what the exchange means — that B's flatness is communicative, that A's Marcus mention is strategic rather than casual. This requires maintaining two frames simultaneously: what is literally said, and what the saying of it reveals.

Dig Deeper Are these capabilities really "emergent"? A contested question ›

Looking at the comparison above, it's tempting to describe capabilities as "emerging" at certain scales — the 4B model can do something the 1.7B model cannot. But a 2023 paper by Schaeffer, Miranda, and Koyejo argues this appearance is partly an artifact of how we measure performance.

Their core finding: when researchers evaluate models on tasks using discontinuous metrics (correct/incorrect, pass/fail), capability appears to switch on suddenly at a certain scale. But when the same tasks are evaluated with continuous metrics that capture partial credit, the improvement looks smooth and gradual — not a phase transition, but a curve that was already progressing at smaller scales.

In plain terms: the apparent "emergence" is partly in the measurement, not only in the model. A model doesn't suddenly acquire the ability to reason at 4B parameters — it improves continuously, but the improvement only crosses a threshold of legibility at a certain scale. The underlying progress was happening all along.

This doesn't mean the differences in the comparisons above are illusory — they're real. But it suggests we should be cautious about treating scale as if it produces sudden, unpredictable capability jumps. The smoother interpretation is probably closer to the truth: scale buys continuous improvement, and tasks with high failure costs make that gradual improvement look like a cliff.

The debate remains active. What counts as a genuine emergent capability versus a measurement artifact is still contested, and the answer may differ by task type.

Section 4

Task-model fit matters as much as size

A 1.5 billion parameter model trained specifically for coding outperforms a larger general-purpose model on coding tasks. The question isn't always "how big?" — it's "trained on what?"

The Qwen model family offers an instructive natural experiment. Qwen2.5 and Qwen2.5-Coder share the same base architecture. At matched parameter counts, the only meaningful difference is training focus: the Coder variant was trained heavily on code, code-adjacent text, and programming-related tasks. Below are side-by-side outputs on a coding task and a general reasoning task, at 1.5B parameters each.

Coding task: base model vs. specialist

Prompt

Write a Python function that takes a list of dictionaries, each with keys 'name' and 'score', and returns the top N entries sorted by score descending. Include a docstring and handle edge cases.

qwen2.5:1.5b General

qwen2.5-coder:1.5b Specialist

def top_students(students_list, num):
    """
    Args:
    students_list (list): ...
    num (int): ...
    Raises:
    ValueError: If num is not
      within [1, len(students_list)].
    """
    if not (1 <= num <= len(students_list)):
        raise ValueError(...)
    sorted_students = sorted(
        students_list,
        key=lambda s: s['score'],
        reverse=True)
    return sorted_students[:num]

Works correctly. Handles the stated edge case (N out of range). Does not handle empty list or N=0. No type hints.

No type hints · Missing edge cases

from typing import List, Dict

def find_top_n_scores(
    scores: List[Dict[str, int]],
    n: int
) -> List[Dict[str, int]]:
    """
    Finds the top N entries based
    on 'score' key.
    Raises ValueError if n >
    len(scores).
    """
    if len(scores) < n:
        raise ValueError(...)
    sorted_scores = sorted(
        scores,
        key=lambda x: x['score'],
        reverse=True)
    return sorted_scores[:n]

def check_function():
    # Includes assertion-based tests

Adds type hints throughout. Includes a separate test function with assertions. Cleaner convention alignment. Output in 3.3s vs 5.8s.

✓ Type hints · Test function · Faster

Reasoning task: the advantage reverses

Prompt

A historian argues that the fall of the Roman Empire was not a collapse but a transformation. Construct the strongest version of this argument, then identify its most significant weakness.

qwen2.5:1.5b General

qwen2.5-coder:1.5b Specialist

The fall of Rome can be best understood as a period of profound transformation rather than a single catastrophic collapse. While the empire faced numerous challenges, these unfolded gradually, leading to significant changes in governance, society, economy, and culture.

Most significant weakness: The argument is overly simplistic by framing the entire period of decline as a single process without acknowledging the multiple causes and factors — economic, military, social, and political — that contributed.

✓ Identifies the actual argumentative weakness

The Roman Empire began in 27 BC with the establishment of the Julio-Claudian dynasty and lasted until AD 476... Rome was home to over 80% of its population, with cities like Milan, Ravenna, and Constantinople serving as centers of power.

Most significant weakness: The argument assumes a comprehensive view but does not account for specific details that led to the decline, such as economic factors like inflation.

Historical errors · Misses the argumentative task

The Coder model's response to the reasoning task is revealing: it produces structured bullet points with plausible-sounding historical detail, but contains errors (Rome's urbanization rate was not 80%), and critically, it never actually constructs a historical argument — it summarizes context instead. This is what a model optimized for structured, procedural output does when given a task that requires interpretive reasoning: it produces the form without the substance.

Scale matters — but so does what a model was trained for, and on what.

We ran a direct cross-family test to check whether the specialization advantage survives a generational gap. Qwen2.5-Coder:1.5b was compared against Qwen3:1.7b — a newer-generation generalist with more parameters — on the same coding task. Qwen3:1.7b produced correct, well-documented code with solid edge case handling, but no type hints and no test function. The Qwen2.5-Coder:1.5b still edges it: type hints throughout, assertion-based tests, and faster output. An older-generation specialist holds its ground on its home task against a newer-generation generalist at a slightly larger size.

One caveat: Qwen3 models are designed around a thinking mode that was disabled for this comparison to match the demonstration setup. Without thinking mode, Qwen3:1.7b on the reasoning task produced a noticeably weaker response than even the Qwen2.5:1.5b base — thin and generic where the base model engaged with the argumentative structure. This likely reflects Qwen3's optimization for reasoning-with-thinking rather than a fundamental capability gap, and is worth bearing in mind when interpreting the cross-family results.

Dig Deeper Fine-tuning: how specialization is actually produced ›

The Coder variant isn't a fundamentally different model — it starts from the same base weights and undergoes additional training on a curated corpus heavy in code, documentation, commit messages, and programming Q&A. This process is called continued pretraining or domain-adaptive pretraining.

What this training does is shift the model's probability distributions toward code-relevant outputs. Given a function signature, a coding model has learned that what follows is likely a docstring, then typed arguments, then a body. A general model knows this too — but less reliably at small parameter counts, where the probability mass is spread more thinly across all domains.

Fine-tuning (training on labeled examples of desired input-output pairs) is a related but distinct process. A fine-tuned model might be trained to always produce a certain output format, or to adopt a specific persona. The distinction matters: continued pretraining shapes what the model "knows"; fine-tuning shapes how it responds.

Dig Deeper Post-training: what parameters alone don't produce ›

The comparisons in Sections 3 and 4 isolate parameter count and training focus. But much of what makes a production model useful — its tendency to be helpful rather than incoherent, to follow instructions, to decline harmful requests — comes from post-training processes that happen after the base pretraining is complete. These are covered in detail in the training explainer.

Reinforcement Learning from Human Feedback (RLHF) and related techniques (RLAIF, DPO) train a model to produce outputs that human raters prefer. This can substantially change a model's behavior without changing its parameter count at all. A base model and an instruction-tuned model built on identical weights can feel dramatically different to use.

This means "parameter count" understates the engineering complexity of what you're using when you interact with Claude or GPT-5. The parameter count describes the substrate; post-training shapes the behavior built on top of it. Scale is necessary but not sufficient — a 70B base model without post-training is much less useful than a 7B model with careful RLHF.

Section 5

What runs where — and why it matters

The gap between what fits on a laptop and what runs in a data center is a capability gap. Understanding it helps you make better choices about which tools to use for which tasks — and raises harder questions about who controls the most capable models.

What your hardware can run

Parameter count determines memory requirements, which determines what hardware is needed. The table below gives practical guidance for common hardware tiers, using 4-bit quantization (the standard approach for consumer hardware).

Hardware Tier	VRAM / RAM	Model Range	Notes
Phone / tablet (on-device AI)	4–8 GB shared	0.5b – 3b	Useful for simple autocomplete, summarization, basic Q&A. Limited reasoning depth.
Laptop, integrated GPU	8–16 GB	1b – 7b	7B models run acceptably at 4-bit. Noticeably slower than discrete GPU. Fine for many tasks.
Consumer GPU (e.g., RTX 4070)	12 GB VRAM	7b – 14b	The sweet spot for local LLM use in 2025–26. 14B models at 4-bit fit comfortably.
High-end consumer (e.g., RTX 4090 / 3090)	24 GB VRAM	14b – 32b	Can run 32B at 4-bit. Noticeably better reasoning than 14B on complex tasks.
Workstation / multi-GPU	48–80 GB+	70b+	70B models approach frontier quality on many tasks. Requires significant investment.
Cloud / data center	Effectively unlimited	100b – unknown	Where frontier models live. Parameter counts often undisclosed. Qualitatively different capability ceiling.

The specialist argument for local deployment

The Section 4 comparison points toward a practical strategy for local use: if your tasks are concentrated in a specific domain, a small specialist model may outperform a larger general model — including, in some cases, models that are both newer-generation and more parameter-rich.

The specialist advantage: confirmed across a generational gap

We tested this directly: Qwen2.5-Coder:1.5b vs. Qwen3:1.7b on the same coding task. Qwen3 is one generation newer with slightly more parameters — a generalist with more recent training. On the coding task, the older specialist still won on professional conventions: type hints, assertion-based tests, faster output. The generational advantage in newer models tends to show up most in broad reasoning; on a narrowly focused task where the specialist has deep training, the architectural novelty doesn't close the gap.

The practical implication is concrete: if you're running models locally for a coding-heavy workflow, a coding-specific model is worth the download. The same logic extends to any domain where specialist models exist — translation, summarization, structured extraction. At local deployment scales, you trade some general capability for reliable on-task performance, and at small parameter counts that tradeoff frequently favors the specialist.

Where this is heading

On-device AI is expanding. Apple, Qualcomm, and others are building neural processing hardware directly into phones and laptops specifically to run small models locally. This shift is driven partly by privacy concerns (your data never leaves the device) and partly by latency (no network round-trip). As of 2026, 3B–7B models are running on flagship phones with respectable quality.

This matters for how you think about the capability table above. The bottom rows will fill in over the next few years — devices that today run 1B models will run 7B models, and the consumer GPU tier will reach 30–70B. The qualitative experience gap between "what you can run locally" and "what the frontier looks like" may narrow substantially for many everyday tasks.

Dig Deeper Energy and water: the costs that scale doesn't advertise ›

Training a large language model consumes enormous amounts of electricity — estimates for models like GPT-4 range into the gigawatt-hour scale, though exact figures are not disclosed. Inference (running the model) is less dramatic per query but adds up at scale: millions of daily queries across data centers represent substantial continuous load.

Water is less discussed but equally real. Data centers require cooling, and many use evaporative cooling systems that consume significant water. A 2023 study estimated that training GPT-3 consumed roughly 700,000 liters of freshwater. Frontier model training today likely requires more.

The shift toward on-device AI changes where this energy consumption occurs, but probably not the total amount — it moves from data centers (visible, measurable, auditable) to distributed consumer devices (invisible, unmeasured, harder to regulate). Whether this represents a genuine reduction in environmental impact or a redistribution of it is an open question that the industry has not answered clearly.

This is worth holding alongside the "local AI is more private" framing: both claims can be true simultaneously, while neither resolves the underlying resource question.

Dig Deeper Open weights vs. closed frontier: who controls the most capable models? ›

The local deployment story — run models you control, on hardware you own — is only possible because some frontier labs release model weights publicly. Meta's Llama series, Mistral, Alibaba's Qwen family, DeepSeek, and others have released weights that anyone can download, modify, and run.

The scale of open-weight models has grown dramatically. As of early 2026, Kimi K2.5 (Moonshot AI) is a 1-trillion-parameter open-weight model; DeepSeek R1 and V3.2 are in the 671–685B range; Qwen3 235B is available directly via Ollama. The gap between the open-weight frontier and the closed frontier — which looked substantial in 2023 — has largely closed on many benchmark evaluations, though differences remain in instruction-following polish and multimodal capability.

The most capable closed models — GPT-5, Claude, Gemini — are not available as open weights. They run exclusively in vendor-controlled data centers, accessible only via API with pricing, rate limits, and usage policies set by the provider. Users have no visibility into the model's weights, cannot modify its behavior, and depend entirely on the vendor's continued access decisions.

Whether this concentration is a problem depends on what you think AI systems are — consumer software, infrastructure, or something more like a public good. That question doesn't have a settled answer, but it's being made implicitly every time a frontier model is withheld from open release — and increasingly, the open-weight ecosystem is a meaningful check on that concentration rather than a distant second.

Dig Deeper Quantization: running bigger models on smaller hardware ›

Full-precision parameters are stored as 32-bit or 16-bit floating point numbers. Quantization reduces this to 8-bit integers (Q8), 4-bit integers (Q4), or even lower. A 4-bit quantized model uses one-quarter the memory of its 16-bit original.

The tradeoff is quality: quantization introduces rounding error, and at lower bit depths, this can measurably degrade performance, particularly on tasks requiring precise numerical reasoning. In practice, well-implemented 4-bit quantization (using formats like GGUF or GPTQ) preserves most of a model's capability — the perplexity increase is typically a few percent.

A quantized larger model often outperforms a smaller full-precision model at the same memory footprint. Running a Q4 14B model (~7 GB) is generally better than running a full-precision 7B model (~14 GB) if your GPU has 8–12 GB VRAM. This is a practical optimization worth knowing.

Tools like Ollama handle quantization automatically — when you pull qwen3:14b, you're getting a quantized version that fits on consumer hardware. The quality cost is real but modest; the accessibility gain is substantial.

Final artifact in the series. This completes the seven-artifact explainer sequence. A companion how this was made artifact — with extended methodology for each piece — is planned as the next step.

References

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. (2022). Toy models of superposition. Transformer Circuits Thread. transformer-circuits.pub/2022/toy_model/index.html ↩

How this was made

Built through vibe-coding — iterative natural-language collaboration with Claude (Anthropic) generating the HTML, CSS, and JavaScript. Section 3's capability-gradient responses and Section 4's specialist comparison were produced by a Python script querying Ollama across five Qwen3 sizes (0.6B, 1.7B, 4B, 8B, 14B) and matched Qwen2.5 / Qwen2.5-Coder pairs, running on an ASUS RTX 4070 (12 GB VRAM, 32 GB system RAM). Results were saved as JSON and Markdown for review; seven candidate prompts were tested, four selected for clear legible gradients.

The Schaeffer et al. (2023) "mirage" paper cited in Section 3's emergent-capabilities dig-deeper was fetched and read during the design conversation rather than cited from memory. Parameter counts for frontier closed models remain undisclosed; claims in this artifact about them are noted as such. Data generated April 2026.

A full walkthrough of the design and build process for all artifacts in this series is available on the series process page.