This page, in its first incarnation, hosted one of our earliest public demos — a pair of little APIs that could summarise a chunk of text, or classify it by topic. You could paste in an article about Genghis Khan or the platypus, and out would come a one-line summary and a set of tags.
We were enormously proud of it at the time. In 2014, this was state of the art. Building a summariser that produced readable sentences took six months and a lot of careful feature engineering — TF-IDF, sentence centrality scoring, stopword lists, the works.
I'm leaving this page up as a small archive, and using it as an excuse to reflect on how far the field has come.
What the demo used to do
Two things:
Summarise a paragraph or two of text, extracting the most important sentences and reducing it to a shorter version. We used a classical extractive approach: score sentences by a combination of position, keyword overlap with the document's topic terms, and length, then pick the highest-scoring ones.
Classify a piece of text by topic. We used a supervised classifier trained on a dataset of Wikipedia articles labelled by their category — biology, technology, history, geography, and so on. Nothing fancy; just multinomial Naive Bayes with TF-IDF features.
Both models ran on a single server. The whole thing, end to end, fit in a few hundred lines of Python.
What's changed in a decade
Everything, and also nothing.
Everything, in the sense that modern LLMs can do both tasks spectacularly better, in any language, with almost no training data, with nuance our classical models couldn't dream of. A modern LLM can summarise a document in the style of your choice, classify it against arbitrary custom taxonomies, and explain its reasoning along the way. What took us six months in 2014 takes six minutes with an API call today.
Nothing, in the sense that the underlying ideas are continuous. Attention mechanisms grew out of older attention ideas. Transformers grew out of RNNs and LSTMs. The principle that "meaning lives in patterns of use across large amounts of text" — the idea that drove our classical models — is the same idea that drives modern LLMs, just realised with unimaginably more data and compute.
What hasn't changed
The things that were true about NLP in 2014 are still true:
- Clean data beats clever algorithms. Garbage in, garbage out, still.
- Domain matters. A summariser trained on news articles will struggle on medical records. A classifier trained on Wikipedia will stumble on internal jargon. Fine-tuning is still worth doing.
- Evaluation is hard. "Is this summary good?" doesn't have an easy automated answer. Even with LLMs, thoughtful human evaluation remains the gold standard.
- NLP at its best is boringly useful. The applications that actually matter in businesses — processing support tickets, summarising contracts, extracting structure from documents — are still the unglamorous ones. The mechanism is new; the problems are the same.
What we do now
We don't host demo APIs any more — the modern equivalents are available in every major cloud, and trying to compete with them would be silly. What we do instead is help companies figure out which NLP capabilities are worth building into their products, how to ground models in their own domain knowledge, how to evaluate them properly, and how to put them in production without shipping a liability.
It's the same job we were doing in 2014, really. Just with better tools.
For old time's sake
If you've landed here from an old link, thank you for reading this reflection instead. The original demo APIs are long gone. The domain names and server configurations they ran on are long gone. But the work continues — quieter, more capable, and (we hope) a little bit warmer in tone.
If you're thinking about NLP for your own product, come and say hello. We'd love to help.