A team once showed me a per-request cost they were proud of — a fraction of a cent. I asked what fraction of those requests succeeded first time. They didn't know. We pulled the numbers: a chunk failed, retried, escalated to a bigger model, then landed on a human reviewer. Their real cost per finished task was many multiples of the number on the slide. They'd been optimising the cheapest thing in the system.
That's the whole story of LLM cost. The token price on the pricing page is easy to see and the least worth obsessing over. The bill that surprises people is assembled from everything around the token: the retries, the evals, the human review, the context you stuffed in because it was simpler than building retrieval. Treat LLM spend like any other unit-economics problem and most of these stop being surprises.
Cost per request is the wrong metric
It's trivial to compute and it tells you almost nothing. It optimises the cheap part and ignores the expensive parts. The metric that matters is cost per resolved task — the fully loaded cost of getting one unit of actual work done, successfully, to a standard you'd accept.
This isn't academic. A request costing a fraction of a cent that fails a quarter of the time, triggers a retry, escalates to a larger model, then lands on a human has a resolved-task cost many times the headline number. Optimise per-request and you'll happily shave the cheap input while the real cost sits in the failure path you aren't measuring.
Where the spend actually leaks
Roughly in order of how often I see teams underestimate it:
- Retries and fallbacks. Every failed call you retry, every escalation to a bigger model when the small one fell short, is real spend that never shows in your per-call estimate. A double-digit retry rate doubles your cost without ever announcing itself.
- Context stuffing. When retrieval is hard, teams paste the whole document in. It works. It's also expensive on every single call, forever. The most common leak I find, because it's invisible until you look at average input token counts.
- Prompt bloat. System prompts that grow by accretion — every incident adds a clause, nothing's ever removed. You pay for those tokens on every request, multiplied by volume.
- Evaluation. Running evals is non-negotiable for quality, but eval traffic is real model traffic. Teams budget production and forget that a serious eval suite can be a meaningful fraction of the production bill.
- Human review. The most expensive token in the system is a minute of a person's time. If your design routes a real share of outputs to human review, that — not the API — is your dominant cost, and it scales with volume the worst.
None of these show up if you only watch the pricing page. All of them show up in cost per resolved task.
Two levers that actually move the number
Caching. A surprising fraction of production traffic is repetitive or near-repetitive. Caching exact and semantically similar responses, plus prompt caching for stable context, takes load off the model directly. Cheapest win available, and the first thing to instrument.
Model routing. Send the cheap, small model first; escalate to the big one only when the task demands it. Most workloads have a long tail of easy cases that don't need your most capable model. The discipline is measuring the escalation rate and the quality at each tier — so routing cuts cost without silently degrading the resolved-task rate. Done blind, routing just relocates the leak.
Self-hosting: when it wins and when it doesn't
The instinct, once the bill grows, is to self-host an open-weight model and escape API pricing. Sometimes that's right. Often it isn't, and the deciding factor is utilisation.
Self-hosting swaps a variable per-token cost for a largely fixed one: GPUs, the engineering to run them, the on-call to keep them up. That maths works when you have high, steady volume keeping expensive hardware busy. It works badly when your traffic is spiky or modest — you pay for idle GPUs around the clock to serve a few busy hours, and you've taken on operational burden the API was absorbing for you.
So the straight version. Self-hosting beats API pricing when you have sustained high utilisation, a privacy or latency requirement the APIs can't meet, and the team to operate inference properly. It loses when volume is low or bursty, when you'd run below capacity, or when "we save on tokens" conveniently ignores the salaries of the people now keeping the cluster alive.
The order of operations
When the AI bill needs managing, work it in this order:
- Instrument cost per resolved task, broken down by feature. You can't manage what you haven't attributed.
- Find the dominant term. Tokens, retries, eval, or human review — usually not what people assume.
- Attack that term specifically. Trim prompts and context if it's input tokens. Fix quality to cut retries if it's the failure path. Reduce review load if it's people.
- Add caching and routing as standing infrastructure, escalation rate monitored.
- Only then model self-hosting, and only if utilisation justifies it.
AI spend is unit economics, not a technology curiosity. Define the unit, attribute the cost, find the dominant term, fix it. The teams that stay solvent here aren't the ones who negotiated the best token price. They're the ones who know their cost per resolved task to the cent and can tell you exactly which term is the leak.