March 02, 2024 · Deep Learning

The quiet revolution called "attention"

Dimple Paratey
Dimple Paratey
Chief Marketing Officer
The quiet revolution called "attention"

If you've ever wondered why AI got suddenly, shockingly good around 2017-2018, the answer has a name: transformers. Not the robots. An architecture for neural networks, introduced in a 2017 paper with the beautifully arrogant title "Attention Is All You Need."

Every AI system you've used recently — ChatGPT, Claude, Gemini, the autocomplete in your email, the translator on your phone, the image generators — is built on transformers. They're the quiet pipework running underneath the entire current wave.

Let me try to explain what they are and why they matter, in a way that a friend who doesn't work in tech might understand.

The problem they solved

Before transformers, neural networks that dealt with sequences — language, speech, time series — had a fundamental limitation. They processed sequences one step at a time, and had a hard time keeping track of context from earlier in the sequence.

Imagine reading a novel one word at a time, and only being able to remember the last few sentences. You'd keep losing the thread. That's what earlier models were like, at scale.

Researchers built various workarounds (LSTMs, GRUs, attention mechanisms grafted onto existing architectures), but none of them really cracked the problem. Sequences longer than a paragraph or two were still hard.

The idea

The insight behind transformers is almost embarrassingly simple in retrospect.

Instead of processing the sequence one step at a time, transformers process the whole sequence at once. For each word (or token), the model looks at every other word in the sequence and figures out which ones are relevant for understanding this one. That looking-at-every-other-word operation is called attention.

Reading "the cat sat on the mat because it was warm" — the model, when processing "it," pays attention to "mat" and to "warm" and figures out what "it" is probably referring to. Not by following a rule. By learning, from having seen billions of sentences, which words tend to be referred to by "it" in which contexts.

Now multiply that mechanism. Stack it. Run many parallel attention operations at once. Train the whole thing on a vast amount of text. What you get is a model that develops, through sheer pattern-matching at scale, something that looks very much like language understanding.

Why it was a revolution

A few things made transformers revolutionary:

They scaled beautifully. Unlike earlier architectures, transformers got substantially better as you threw more data and more compute at them. The scaling continued well past the point where researchers expected diminishing returns. This surprised everyone, and is still surprising people.

They parallelised well. Because they process the whole sequence at once, they use modern GPUs and TPUs far more efficiently than sequential architectures. This meant training times dropped even as model sizes grew.

They generalised. The same architecture, with minimal modification, turned out to work for text, images, audio, protein structure, code, and more. That's extremely unusual. Most architectures are specialised for their domain.

They enabled transfer. You could train a transformer on a huge corpus once, then fine-tune it for many specific tasks. This was the pattern that gave us BERT, GPT, and eventually ChatGPT.

What we've learned since

In the eight years since the transformer paper:

  • Models have grown from hundreds of millions of parameters to trillions.
  • Context windows (how much of a conversation the model can "see" at once) have grown from a few hundred words to millions.
  • Transformers now underlie most frontier work in protein folding, code generation, image generation, and multimodal understanding.
  • We've discovered that certain capabilities emerge only at scale — reasoning, instruction-following, tool use. Smaller models just don't have them.
  • We've also discovered the limitations: they hallucinate, they don't reason reliably, they can be jailbroken, they reflect their training data's biases.

What they don't do

Understanding transformers also helps you understand what they aren't.

  • They don't "think." They process. The output is produced one token at a time, based on statistical patterns learned from training.
  • They don't "know" facts. They model the patterns of how humans write about facts. Sometimes the patterns produce correct facts. Sometimes they produce confident-sounding plausible lies.
  • They're not general intelligence. Despite the marketing. They're extraordinarily impressive specialised systems for modelling sequences.

Keeping this in mind is the difference between getting great value from an LLM and being misled by one.

What's next

A few trends worth knowing about:

  • Mixture-of-experts architectures that only activate part of the model for each query, making very large models much cheaper to run.
  • State-space models (like Mamba) that revisit some ideas from pre-transformer architectures with modern tricks. Early results are promising for very long sequences.
  • Diffusion language models that generate text non-linearly, rather than one word at a time. Still experimental, but interesting.
  • Retrieval-augmented generation — giving the model access to an external knowledge base so it doesn't have to remember everything. This is where a lot of practical deployment is happening.

The transformer era isn't over. But it's not the final chapter, either.

Why I find this all quite moving

The 2017 transformer paper was written by eight researchers at Google Brain. It was a clean, well-argued paper about a modest-seeming improvement to existing architectures. None of them, by their own later accounts, expected it to become the foundation of the most significant AI wave in history.

That's the thing about real scientific progress. It often doesn't look like it's happening. Until suddenly, looking back, it was.

If you're building on top of transformers and would like to think through how to do it well, we'd love to help.

Dimple Paratey
Dimple Paratey
Chief Marketing Officer

As CMO of Partech Systems, Dimple Paratey drives technological innovation with over 15 years of digital transformation leadership at major telecom providers. Her expertise in transforming enterprise operations has delivered breakthrough solutions for global telecommunications companies. Recognized for her strategic vision in AI adoption, she champions the intersection of innovation and business growth across multiple industries.