If you've wondered why AI got suddenly, almost suspiciously good around 2017, the answer is a single architecture with a wonderfully arrogant name. Eight researchers at Google Brain published a paper called "Attention Is All You Need," and almost everything you've used since — ChatGPT, Claude, Gemini, your email autocomplete, the translator on your phone, the image generators — runs on what they described. Transformers. Not the robots. The plumbing under the entire current wave.
I want to explain what they do without the mysticism, because understanding the mechanism is what lets you use these things without being fooled by them.
The problem they killed
Before transformers, networks that handled sequences — language, speech, time series — processed them one step at a time and struggled to hold onto context from earlier in the sequence. Picture reading a novel a word at a time while only being able to remember the last couple of sentences. You'd lose the plot constantly. That, at scale, was the previous generation of models. Researchers bolted on workarounds — LSTMs, GRUs, early attention tricks — but nothing really solved it. Anything longer than a paragraph or two stayed hard.
The idea, plainly
The trick is almost embarrassing in hindsight. Instead of crawling through the sequence step by step, a transformer looks at the whole thing at once. For every token, the model asks which other tokens matter for understanding this one, and weights them accordingly. That weighting operation is attention.
Take "the cat sat on the mat because it was warm." When the model processes "it," attention lets it weigh "mat" and "warm" and work out what "it" refers to — not by following a grammar rule somebody wrote, but by having seen billions of sentences and learned the statistical shape of how "it" gets used. Now stack that mechanism, run many attention operations in parallel, and train it on an enormous pile of text. Out the other end comes something that behaves a great deal like language understanding, built entirely from pattern-matching at scale.
Why it was a genuine break, not hype
Four things, and they compounded.
It scaled — which nobody expected to the degree it did. Throw more data and more compute at a transformer and it kept getting better well past the point where every prior architecture hit diminishing returns. That single property is most of the last decade of AI. It parallelised, because processing the whole sequence at once maps beautifully onto GPUs and TPUs, so training got faster even as models got bigger. It generalised: the same architecture, barely modified, turned out to work for text, images, audio, code, and protein structure — almost unheard of, since architectures are usually married to their domain. And it enabled transfer: train once on a huge corpus, fine-tune cheaply for a hundred specific tasks. That pattern gave us BERT, GPT, and eventually the assistant you're probably using today.
What's happened since, and what it can't do
Eight years on, models have gone from hundreds of millions of parameters to trillions; context windows from a few hundred words to millions; and certain capabilities — multi-step reasoning, instruction-following, tool use — turn out to emerge only at scale, which is to say smaller models simply don't have them. We've also learned the failure modes the hard way: they hallucinate, they don't reason reliably, they can be jailbroken, and they inherit their training data's biases.
So be precise about what a transformer is not. It doesn't think — it processes, emitting one token at a time from learned statistical patterns. It doesn't know facts — it models how humans write about facts, which sometimes yields a true statement and sometimes a confident, plausible lie. And it is not general intelligence, regardless of what the marketing implies; it's an extraordinarily capable specialised system for modelling sequences. Holding that distinction in your head is the difference between extracting real value from one of these models and being misled by it.
What's next
A few threads worth tracking. Mixture-of-experts architectures activate only part of the model per query, making enormous models far cheaper to run. State-space models like Mamba revisit pre-transformer ideas with modern engineering and look strong on very long sequences. Diffusion language models generate text non-linearly rather than left to right — experimental, but interesting. And retrieval-augmented generation gives the model an external knowledge base so it doesn't have to memorise everything, which is where most serious production deployment is actually happening right now.
The transformer era isn't over. It also isn't the final chapter, and anyone telling you the architecture is settled hasn't been doing this long enough.
One last thing worth sitting with. That 2017 paper was a clean, modest argument about an improvement to existing architectures. By their own later accounts, none of the eight authors expected it to become the foundation of the most significant AI wave in history. Real progress usually looks like that — unremarkable at the time, obvious only in the rear-view mirror. I've watched enough hype cycles to distrust the loud ones. The thing that reshapes everything tends to arrive in a paper nobody's shouting about.