Here is a claim I'll defend: for a large class of products, putting the AI model in the cloud is the wrong default, and teams reach for it out of habit rather than analysis.
"Edge AI" sounds like a frontier. It isn't. It just means the model runs on the device collecting the data instead of on someone else's server. Your phone does it when it blurs your background on a call. Your doorbell does it when it decides a person is at the door and not a cat. Your car does it when it holds the lane. None of those ship video to a datacentre and wait for an answer. They decide locally, on a chip the size of a stamp.
There are four reasons to do that, and they're worth being precise about because they decide your architecture.
Latency. A cloud round trip is 50 to 500 milliseconds. Local inference is single digits. For anything reacting in real time — an emergency brake, a defect kicked off a line, noise cancellation in a hearing aid — the cloud is simply too slow, and no amount of CDN tuning fixes the speed of light.
Privacy. Data that never leaves the device cannot leak from the device. Face-unlock is better engineered than most cloud auth schemes for one boring reason: your face never goes anywhere, so there's no breach to have.
Resilience. Edge systems keep running when the network dies. If your patient monitor needs 4G to decide whether to alarm, you have built a worse monitor and called it smart.
Cost. Streaming video to the cloud to be analysed is a recurring bill that scales with usage. A local chip doing the same work costs pennies of electricity and nothing per inference. At fleet scale that difference is the whole business case.
This used to be impractical. Three things changed it, none of them headline-grabbing. Distillation, which shrinks a network ten- to a hundred-fold while keeping most of its accuracy — the classifier on your phone is a small cousin of the one that trained on a rack of GPUs. Dedicated silicon — phones and microcontrollers now ship neural accelerators that are far more power-efficient than a general CPU at this work. And tooling that finally works: TensorFlow Lite, ONNX Runtime, CoreML, PyTorch Mobile. Porting a trained model to a device is now a day's work, not a month's.
The results show up in places that never make the news. A handheld ultrasound probe with an on-device classifier flagging high-risk pregnancies in a clinic with no reliable internet — the device doesn't care that there's no signal. Solar audio sensors in the rainforest classifying chainsaw and gunshot sounds locally, only firing up the satellite uplink when there's something worth sending. Camera-on-a-chip units over forklifts warning when someone walks into the danger zone, in real time, with no server in the loop to fail.
Now the part the brochures skip. If you build at the edge, three things will bite you.
The edge is not one platform. Your model runs across chips with wildly different memory, instruction sets and accelerators. Plan for fragmentation from the start or you'll be maintaining a dozen builds you didn't budget for.
Updates are a real engineering problem. Pushing a new model to a million field devices — safely, reversibly, without bricking anything — is harder than training the model was. Design the update path before you ship the first unit, not after.
Observability inverts. You can't log everything; bandwidth and battery are scarce. You have to decide up front the minimum telemetry that tells you the model is still working, and resist the urge to phone home about everything.
Get those three right and edge AI mostly disappears into the product — the monitor that just works offline, the camera that decides on its own. That invisibility is the goal, not a consolation. The best version of this technology is the version nobody notices.