the weights.
The weights are the substance.
A language model is, mechanically, a long table of numbers. DistilGPT2 holds 82 million of them — a third the size of GPT-2, distilled from it the way a spirit is distilled from wine. The numbers were arrived at by gradient descent over forty gigabytes of internet text. They will never be arrived at again in exactly that arrangement; the run was stochastic, the data has shifted. What you load below is a fossil of that single passage through the corpus.
Until 2023, running a model like this in a web browser was not really a thing. It required a server, a GPU, an account, an API key, a billing relationship. Then ONNX Runtime got a WebAssembly target, the Hugging Face team published transformers.js, and the door opened: download the weights once, cache them in the browser, do the multiplications locally. No request leaves your machine. The cost of one inference is the cost of the electricity you used to run it.
The model is on your computer now. It was not, before you scrolled.
What happens between the click and the word.
Each token you see appear above is the result of: a forward pass through six attention blocks, a softmax over 50,257 logits, a sample drawn from the resulting distribution, a decode back to bytes, an append to the running sequence. Every new token requires another forward pass. The model has no memory between calls; the entire prefix is fed back in each time. What feels like a sentence being thought is a sentence being re-thought, one word longer, fifty times in a row.
Small, sober, present.
DistilGPT2 is not a smart model. It will rhyme nonsense, contradict itself, drift into fragments. It is here as a specimen — a thing small enough to hold in the hand of a webpage, large enough to be unmistakably the same shape as the things that have changed the world. The architecture under the hood is the architecture under GPT-4, Claude, Gemini, Llama. Wider, deeper, trained on more, but the same diagram.
What changes when inference is local: privacy stops being a policy and becomes a property. The page you are reading does not know what you typed into the box. There is no log on a server somewhere. There is also no moat — anyone can do this now, on a laptop, with a few megabytes of JavaScript and a model file from Hugging Face. The interesting question is no longer can it run. It is what is worth running.
The cost of one inference is the cost of the electricity you used to run it.