Every few weeks another frontier model lands, another trillion parameters, another leaderboard, another round of breathless coverage. Fine. Some of it matters. But the most useful AI work my team has shipped in the last year didn't touch a frontier model at all. It ran on an 8B model, fine-tuned on one narrow task, quantised down to run on hardware a customer already owned. No data left the building. Latency under a hundred milliseconds. Cost per query rounding to zero.
Nobody wrote a press release about it. It just worked, and it kept working when the internet didn't.
Here's the bit the size race keeps missing. For an enormous number of real tasks, you don't need a model that can write a sonnet, debug Rust, and explain Ottoman tax policy. You need a model that does one thing — classify this ticket, extract these fields, redact this document, answer questions about this manual — reliably, cheaply, and fast. A general model the size of a small country is overkill for that, and overkill has costs that don't show up on the leaderboard.
The wins from going small are not marginal. They're structural.
Privacy first, because it's the one that closes deals. When the model runs on the device or on a server inside the customer's own walls, the sensitive data never crosses the network. For anyone in healthcare, finance, defence, or any regulated space, "the data never leaves" isn't a feature, it's the entire precondition for the conversation. I've watched procurement processes that would've taken nine months collapse to weeks the moment we could say the inference happens on-premise. You cannot leak what you never send.
Latency next. A round trip to a cloud API is a hundred, two hundred milliseconds before the model has thought about anything. On-device, you skip the journey entirely. For anything interactive — a keyboard, a camera, a control loop, a thing a human is waiting on — that difference is the difference between magic and irritating. And it doesn't fall over when the connection does. Edge devices in a factory, a vehicle, a remote site, a hospital basement with terrible signal — they need to keep working regardless. A model that requires a healthy uplink to function is a model that fails exactly when you need it most.
Then cost. Frontier inference billed per token is fine until you're doing it a few million times a day, at which point the bill becomes a strategic problem. A small model on hardware you already paid for has a marginal cost that approaches nothing. At volume that's not a saving, it's a different business model.
Now, the honest part, because going small is not free and anyone who tells you otherwise is selling something. The constraints are real and you have to engineer around them.
Quantisation is the main lever and the main trap. Dropping from 16-bit to 8-bit weights usually costs you almost nothing measurable and roughly halves your memory. Push to 4-bit and you'll often still be fine — but "often" is doing real work in that sentence. On some tasks 4-bit quietly degrades in ways that don't show up until you look closely, and the degradation isn't uniform across what the model can do. You don't get to assume. You measure, on your task, with your data.
Memory is the hard wall. An 8B model in 4-bit is roughly five gigabytes before you've loaded a single token of context, and context eats memory too. On a phone, a browser tab, an embedded board, that ceiling decides everything. Half of small-model engineering is just making the thing fit and stay fit while it runs.
Which is why eval discipline matters more here, not less. With a giant general model you can lean on its slack. A small fine-tuned model has no slack — it's sharp on the task you trained it for and it falls off a cliff just outside that. So you need a real evaluation set drawn from real inputs, and you need to run it every time you change the model, the quantisation, or the fine-tune. The failure mode I see constantly is a team that fine-tuned once, eyeballed a few outputs, declared victory, and never built the harness that would've told them when it broke.
The payoff, when an 8B model fine-tuned on your task beats a frontier model, is real and more common than people expect. On a narrow, well-defined job with good training data, the small specialist regularly outperforms the giant generalist — and it's faster, cheaper, and private while doing it. The generalist knows a little about everything. The specialist knows your thing cold.
So here's the heuristic I actually use when someone asks what size model to reach for. If the task is narrow, repetitive, and you can describe "right" precisely — start small, fine-tune, and only scale up if the evals demand it. If the task is open-ended, varied, needs broad world knowledge or genuine multi-step reasoning across domains you can't enumerate — reach for the big general model and pay for it. And if it sits in between, prototype on the big model to prove the task is even solvable, then distil down to the smallest thing that holds the bar. Prove it works, then make it small.
The frontier will keep climbing and that's fine. But most of the value in the next few years won't be at the frontier. It'll be in thousands of small, sharp, boring models running quietly close to the data. That's where the engineering is interesting and the economics actually close.