A transformer reads everything at once
The transformer's one real trick is reading every token at once and letting each decide what matters. We put the whole machine on the bench — embeddings, positions, the residual stream, the feed-forward step — and work out why reading everything at once was such a departure, and why something so architecturally dull keeps getting smarter the more we feed it. With an interactive animation for every piece.
Here's a claim worth testing as you read: the transformer has no real idea what a sentence is. It never reads left to right, never builds meaning up word by word the way you're doing right now. It looks at the whole thing at once, lets every word quietly ask every other word do you matter to me?, and then repeats that a few dozen times. That's very nearly the whole idea. The part still worth being surprised by is that it turned out to be enough.
A couple of weeks ago I wrote about attention itself: the queries, keys and values, the softmax, the little √dₖ scaling that keeps it trainable. This post zooms out. I want to put the whole machine on the bench, work out why reading everything at once was such a departure from what came before, and why something so architecturally dull keeps getting smarter the more we feed it. We'll build it up one piece at a time, and there's a small animation for each piece so you can poke at it rather than take my word for it.
The thing that was actually new
Before 2017, the obvious way to handle a sentence was to read it in order. A recurrent network (the RNN and its smarter cousin the LSTM) keeps a running summary in a hidden state and updates it one token at a time, passing the state along like a baton. It works. But it has two problems that turn out to be the same problem wearing different hats.
The first is speed. Because token t depends on the state from token t−1, you cannot compute them in parallel; you're stuck walking the sentence end to end. The second is distance. For word 1 to influence word 50, its signal has to survive 49 hops down the chain, and in practice it fades. The path between two tokens is as long as the gap between them.
The transformer throws the baton away. Every token looks at every other token directly, in a single step, so the longest path between any two of them is one hop instead of n. And with no baton to pass, all the positions can be computed at the same time, which happens to be exactly the dense matrix multiply a GPU does best. Press play and watch the difference: the RNN is still crawling along its sentence while the transformer is already done.
That's the trade in one picture. Slide the length up and the RNN's step count climbs with it while the transformer stays at one step, the same wall-clock whether the sentence is three words or three thousand.
What they gave up to get this
Reading in order isn't only a cost; it's also a hint. An RNN gets the sequence for free, baked into the shape of the computation. Attention throws that hint away to win the parallelism, so now the model genuinely cannot tell the cat sat from sat cat the — to a bag of weighted averages they're identical. The rest of this post is, in a sense, the price of that decision: we have to hand order back to the model on purpose, and we have to give it somewhere to do its thinking now that the hidden state is gone.
First, words have to become numbers
A transformer never sees letters. Text is chopped into tokens (roughly words, sometimes word-pieces) and each token is looked up in a big table to get a vector, a list of a few thousand numbers. That's the only thing the model reads: a stack of vectors, one per token.
The interesting bit is what those vectors learn to mean. Nobody tells the model that a cat is like a dog. But because cats and dogs show up in similar sentences, training nudges their vectors close together, and words used differently drift apart. Meaning becomes geometry. Hover the map below; you'll find the animals huddled in one corner, the numbers in another, the royalty off on their own.
The famous party trick is that directions mean things too. The arrow from man to king is the same arrow as the one from woman to queen: somewhere in the space there's a direction that means "royalty", and you can walk along it. Hit the button and watch the parallelogram close. Nobody designed that. It falls out of predicting the next word, which I still think is a little bit magic.
Then order has to be bolted back on
Now the awkward consequence of the callout above. Attention treats its inputs as a set: shuffle the token vectors and the output shuffles with them, unchanged. For a model of language that's useless, since dog bites man and man bites dog had better come out different.
So we stamp each position with a fingerprint and add it to the token's vector before anything else happens. The original idea was a set of sine and cosine waves at different frequencies; each position gets a unique pattern, and, handily, nearby positions get similar patterns, so "close together" is something the model can read straight off the sum. Drag the slider and watch one position's fingerprint light up.
Modern models mostly use a slicker version called RoPE, which rotates pairs of dimensions by an angle proportional to position instead of adding waves. Flip to the RoPE view: the lovely property is that the angle between two tokens only depends on how far apart they are, not where they sit in the sentence. That's a big part of why today's models can be fed documents far longer than anything they trained on without falling over.
Attention: the one place tokens talk
I won't re-derive attention here, since the earlier post does that properly. The one-line version is below for completeness, and then I want to make a single point about where it sits in the machine, because I think that's the part people miss.
Here's the live version: each token is a node, and the weights are the flowing connections. The pronoun has to find its noun; a noun wants to be found. Watch which tokens pull hardest on which.
Why that framing helps
Once you see attention as the only mixing step, a lot of the engineering makes sense. The reason long contexts are expensive is that this one step compares every token to every other, so its cost grows with the square of the length. The reason the KV cache exists, and why so much effort goes into shrinking it, is that this is the only step that has to remember the whole past. Find the one quadratic operation and you've found where all the bodies are buried.
The residual stream is the real architecture
If attention gets all the headlines, the part that actually makes the thing trainable is quieter, and it's the piece I most wish someone had drawn for me early on. Picture each token riding its own lane straight up through the network. That lane is the residual stream, and the crucial rule is that a layer never overwrites it. A layer reads the current vector, computes something, and adds the result back. One transformer block is just two of those read-and-add steps in a row:
Drag the depth slider to stack more blocks, and use the toggle to watch each sublayer do its job: attention reaching across the lanes, the feed-forward step working down each lane on its own.
Why adding, not replacing, is the whole trick
Because every block adds to the stream rather than rebuilding it, there's an unbroken additive path from the very first embedding all the way to the output. Gradients flow straight back down that path without having to squeeze through every layer's nonlinearity, which is what lets you stack ninety-something blocks and still train the thing. It also means a block can choose to do almost nothing (add a near-zero vector) and politely get out of the way, which early layers often do. The residual stream is less a pipe and more a shared notepad: each block reads it, scribbles an edit in the margin, and passes it up.
Where the knowledge actually lives
So attention moves information sideways between tokens. The other half of each block, the feed-forward network, is where a token sits and thinks about what it just gathered. It's almost embarrassingly plain: blow the vector up into a much wider space, apply a nonlinearity that decides which of those wide features switch on, then squash it back down.
Slide the input and watch which hidden units light up. Different inputs fire different combinations, which is the rough mechanism by which facts get stored and retrieved.
It looks like the least clever part of the model, and yet roughly two-thirds of the parameters live in these two matrices. When people talk about a model "knowing" that Paris is in France, the best current guess is that the knowing is in here, distributed across which hidden units a "Paris"-ish vector switches on. Attention fetches the right context; the feed-forward step is the lookup table that turns context into content.
Stack it, then make it enormous
And that's the machine. Embed the tokens, add their positions, then repeat one block (attention to mix, feed-forward to think, both added back to the stream) some number of times, and read a next-token guess off the top. Honestly, written out like that it's a bit of an anticlimax. There's no reasoning engine, no logic module, no place where the cleverness obviously lives. It's the same dull block, over and over.
Which makes the next fact the genuinely strange one. If you take that dull block and make it bigger (more blocks, wider vectors, more data) the loss doesn't plateau where you'd expect. It keeps falling, smoothly, as a power law:
A straight line on a log-log plot is a quietly astonishing thing to find in a system this complicated. It says: build it ten times bigger and you'll get a predictable amount better, and we mostly don't know where it stops. Somewhere along that line the model stops merely finishing your sentences and starts doing arithmetic you never trained it on, or following an instruction it's seeing for the first time in the prompt. That last one, in-context learning, still doesn't have a tidy explanation, and it's the bit I find hardest to be blasé about.
The bitter lesson
Rich Sutton has an essay every few months I end up re-reading, called The Bitter Lesson. The short version: over and over in AI history, the clever hand-built method that encodes what we know about the problem eventually loses to the dumb general method that just scales with compute. The transformer is the cleanest example yet. We didn't teach it grammar or facts or reasoning. We built something that reads everything at once, made it differentiable end to end so it could be trained at scale, and then made it very, very large. The architecture's job, it turns out, was mostly to get out of the way of the scaling.
Some food for thought, and I genuinely don't know the answer: a model trained only to guess the next token, with no notion of truth or intent, nonetheless has to model whatever produced the text to do that well, and the text was produced by people who do have truth and intent. Does it follow that a good enough next-word guesser ends up with something worth calling understanding, or is that just us seeing faces in clouds? I lean somewhere in the middle on a good day. Worth chewing on, anyway.
From a vector back to a word
One loose end. The top of the stack gives you, per position, a rich vector. To actually say something the model turns that into a score for every word in its vocabulary and samples one. That's the same softmax from the attention equation, now with a temperature knob, and I covered the sampling knobs in the earlier post, so here's the toy to close the loop:
Turn the temperature down and it plays it safe; turn it up and it gets adventurous and occasionally silly. Then the chosen token gets appended to the input and the whole thing runs again for the next one. That's all "generation" is: this machine, in a loop, one word at a time, which is a funny ending given we started by celebrating that it reads everything at once.
So, the whole thing in plain words
A transformer turns words into vectors, tells each vector where it sits, then refines them by letting them look at each other (attention) and think on their own (feed-forward), over and over, adding each refinement to a shared running total. Stack that a few dozen times, make it enormous, and a machine whose only goal is guessing the next word turns out to be able to write code, hold a conversation, and surprise the people who built it. The architecture is simple on purpose. The behaviour is not, and we're still working out why.
Next time I want to look at the other half of the modern AI story, the image generators, which pull off something that sounds impossible when you first hear it: they paint by removing noise that was never there. Different trick, same flavour of "surely that can't work, and yet". Watch this space.
Reading further
- Vaswani et al., Attention Is All You Need (2017): the paper that dropped recurrence and started all of this. Short, and still the canonical reference. arXiv:1706.03762
- Elhage et al., A Mathematical Framework for Transformer Circuits (2021): where the residual-stream-as-shared-notepad picture is made precise. If this post's framing clicked, read this next. transformer-circuits.pub
- Su et al., RoFormer: Rotary Position Embedding (2021): the RoPE trick in the positional-encoding section. arXiv:2104.09864
- Kaplan et al., Scaling Laws for Neural Language Models (2020) and Hoffmann et al., Chinchilla (2022): where the straight line on the log-log plot comes from, and how to spend a compute budget. arXiv:2001.08361, arXiv:2203.15556
- Sutton, The Bitter Lesson (2019): two pages, no equations, and it'll change how you read every result above. incompleteideas.net
- Alammar, The Illustrated Transformer: the diagrams most people picture when they say "attention". A gentler visual companion to all of this. jalammar.github.io
Try it in the lab
All effects →Self-Attention
aiMulti-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.
attentiontransformerdeep-learningGradient Descent
aiSGD, Momentum, RMSProp, and Adam racing down a loss landscape — ravines, saddles, and local minima.
optimizationdeep-learningtrainingA* Pathfinder
aiA*, Dijkstra, and greedy best-first search — the heuristic pulling the frontier toward the goal.
searchgraphsa-star
More from the blog
Four ways to shrink a KV cache
A transformer's KV cache is a four-dimensional tensor, and every compression trick — quantisation, eviction, cross-layer sharing, linear attention — attacks one of its axes. Here is the tour, and the cautionary tale of a tiny code model whose accuracy fell 20 points because a smoke test never exercised the one axis that bites.
Backprop is just the chain rule
Training a neural network sounds mystical, but the engine underneath is one idea from first-year calculus: the chain rule, applied backwards through a computation graph and reusing its work. We trace a forward and backward pass through a tiny graph, see why we run it in reverse, and connect it to the downhill step that actually does the learning.
How to paint with noise
Image generators start from pure TV static and end with a photo. The trick that makes it possible is wonderfully sneaky: don't learn to paint, learn to remove a little noise, then run that backwards from static. We build the forward noising process step by step, see the signal-versus-noise schedule, and work out why predicting noise is such a clever thing to train.