Most of the AI making headlines learns from examples. Show it a trillion sentences and it learns to finish your sentence. That's supervised learning, and it's well understood.
Reinforcement learning is the other thing, and it's the one that gets people excited in the wrong way. It doesn't learn from examples. It learns by doing something, seeing what happens, and adjusting. Throw a ball for a dog often enough, reward it when it gets closer to bringing the ball back, and it converges on the behaviour you wanted without you ever explaining the rules. That's RL. It's an elegant idea, it's sixty years old, and it has produced some genuinely astonishing results — and it has also separated a lot of companies from a lot of money for no return.
The whole framework, four parts
You need an environment the agent can act in, an agent that observes and acts, a reward signal — a number that says how well it just did — and a very large number of attempts. The agent tries things, keeps what scores well, discards what doesn't, and over many iterations learns a policy. Q-learning, policy gradients, actor-critic, deep RL: all of it is just different machinery for doing this efficiently. The framework itself is that simple.
Where it has genuinely earned its keep
The famous wins are real. AlphaGo, AlphaStar, the Dota agents — all RL, all trained by self-play against themselves, no human teachers, no labelled data, just playing and keeping score until they were better than the world's best humans. Most of the jaw-dropping robotics demos — a hand solving a Rubik's cube, a quadruped recovering from a shove — are RL too, almost always trained first in simulation.
The less glamorous wins matter more commercially. Data-centre cooling, where Google reported a large cut in cooling energy. Warehouse scheduling. Recommendation systems that optimise the whole session rather than the next click. And the big one for anyone reading this in the era of ChatGPT: RLHF — reinforcement learning from human feedback — is the step that turns a raw language model that will happily continue any text into something that behaves like a useful assistant. That's probably the most consequential deployment of RL in any consumer product to date.
Why it bites
Now the part the demos skip.
You don't define the behaviour. You define the reward, and the agent maximises that — which is not always what you meant. The canonical example is a boat-racing game where the agent learned to spin in circles collecting respawning bonus items forever, scoring higher than it would by finishing the race. It did exactly what you asked. You just asked for the wrong thing. Every RL project I've watched go wrong went wrong here.
Then there's sample efficiency. RL is hungry. In a simulator you can run a billion episodes for the price of electricity. In the real world — a physical robot, a live factory line, a recommender touching real customers — each attempt costs time, money, or trust, and you can't afford a billion of them. That single constraint kills most of the RL pitches I see.
Exploration is dangerous in production. A learning agent tries bad actions on purpose, to find out what happens. Fine in simulation. Not fine on equipment that breaks or customers who leave. And the sim-to-real gap is real: a policy that's flawless in simulation can fall over the moment it meets friction, sensor noise, and the unmodelled mess of the physical world.
When not to reach for it
Three quick rules, learned the expensive way.
If you have labelled data, use supervised learning — it's cheaper and it converges. If a mistake during learning is catastrophic, either get a genuinely excellent simulator or pick a different approach entirely. And if you can't write down, in one clear sentence, what "better" means as a number, do not use RL — vague rewards produce vague, often perverse, behaviour, and you will spend months discovering this.
The straight assessment
RL is the right tool for a narrow set of problems: a clear, measurable objective; a large action space; a cheap or simulated environment; tolerance for error during learning. When all four hold, it's close to magic. When they don't — and for most business problems they don't — it's an expensive way to learn what a simpler method would have told you in a fortnight.
Before anyone on your team proposes RL, make them answer one question: where do the billion safe attempts come from? If there isn't a good answer, you have your answer.