This page used to host one of our earliest public demos: a pair of small APIs that summarised a chunk of text and classified it by topic. Paste in an article about Genghis Khan or the platypus, get back a one-line summary and a set of tags. The demo is long dead. I'm leaving the page up because it's a useful marker for how far the field has moved — and, more usefully, for how far it hasn't.
We were genuinely proud of it. In 2014 this was the state of the art, and getting a summariser to produce readable sentences took six months and a great deal of careful feature engineering — TF-IDF, sentence centrality scoring, stopword lists, the lot. Today an intern reproduces it before lunch with an API call. I want to be clear-eyed about both of those facts.
What it actually did
Two things. It summarised a paragraph or two using a classical extractive method: score each sentence by position, keyword overlap with the document's topic terms, and length, then keep the highest scorers. No generation, no rewriting — it picked the best sentences you already had. And it classified text by topic with a supervised model trained on Wikipedia articles labelled by category — biology, technology, history, geography. Nothing exotic: multinomial Naive Bayes over TF-IDF features.
Both ran on one server. End to end, the whole thing was a few hundred lines of Python. That constraint mattered then and it's worth remembering now, when "a few hundred lines on one box" has become an unusual way to ship anything.
What the transformers ate
Most of it, frankly. A modern LLM does both tasks far better, in any language, with almost no task-specific training. It'll summarise in whatever register you ask for, classify against an arbitrary custom taxonomy you invent on the spot, and explain its reasoning on the way past. The six months of feature engineering we sweated over in 2014 is now a prompt. I've shipped the 2014 system and I've watched the 2020s eat it, and I'm not sentimental about it — that's progress doing exactly what it's supposed to.
What's striking, though, is the continuity underneath the discontinuity. Attention grew out of older attention ideas. Transformers grew out of RNNs and LSTMs. And the core premise our Naive Bayes classifier ran on — that meaning lives in patterns of use across large amounts of text — is precisely the premise a frontier LLM runs on. Same idea. Unimaginably more data and compute, and a far better architecture for exploiting it. The leap is real, but it's a leap along a line, not off a cliff.
What didn't change at all
This is the part I'd actually fight you over, because it's where teams keep getting burned by assuming the new tools repealed the old laws.
- Clean data still beats clever algorithms. Garbage in, garbage out. The model got a thousand times better; this did not move an inch.
- Domain still matters. A summariser tuned on news will flail on medical records; a classifier trained on Wikipedia will trip over your internal jargon. Grounding and fine-tuning are still worth the effort.
- Evaluation is still hard. "Is this summary any good?" has no clean automated answer, even now. Thoughtful human evaluation is still the gold standard, and the teams who skip it ship liabilities.
- The useful applications are still boring. Processing support tickets, summarising contracts, pulling structure out of documents — the unglamorous jobs were the ones that paid in 2014 and they're the ones that pay now. The mechanism changed. The problems didn't.
What we do instead now
We don't host demo APIs any more — every major cloud ships better ones for free, and competing with that would be daft. The job now is helping companies decide which NLP capabilities are actually worth building into a product, how to ground a model in their own domain knowledge, how to evaluate it properly, and how to get it into production without shipping something that becomes a liability the first time it's wrong in front of a customer.
Which, stripped of the tooling, is the same job we were doing in 2014. The tools got extraordinary. The hard questions — what's worth building, how do you know it works, how do you ship it safely — stayed exactly where they were.
If you've arrived here from an old link chasing the original demo, it's gone, along with the server it ran on. If you're working on NLP for your own product and want a second opinion on which parts are worth building, get in touch.