I've now reviewed enough disappointing retrieval-augmented generation projects to spot the diagnosis before anyone opens a laptop. The team is brilliant. The model is the latest and best. The vector database is the fashionable one. And the answers are still wrong, vague, or confidently citing the wrong document. Then someone suggests trying a bigger model, and that's the moment I know exactly what went wrong.
They built an AI project. They needed a search project.
RAG is, mechanically, very simple. You find the relevant bits of your knowledge base, you stuff them into the prompt, the model answers using them. That's it. And in that pipeline the model is almost never the bottleneck. The bottleneck is the "find the relevant bits" step — the retrieval — which is a search and information-retrieval problem that the field spent thirty years working on before transformers were a twinkle in anyone's eye. The teams that nail RAG are the ones who remember that. The teams that struggle are the ones who think the embedding model absolves them of doing search properly.
If the right chunk never makes it into the prompt, the cleverest model on earth will answer from thin air. Garbage retrieval in, confident nonsense out — and now it's nonsense with a citation, which is worse, because it looks trustworthy.
So let's talk about the unglamorous things that actually decide whether your RAG works.
Chunking, first, because it's the most underrated decision in the whole stack. How you split your documents determines what can ever be retrieved. Split too small and you shred the context — half a sentence retrieves cleanly and means nothing. Split too big and each chunk is a muddle of five topics, so the embedding represents none of them well and the model drowns in noise. There's no universal right answer; it depends on your documents. But splitting on real structure — sections, headings, semantic boundaries — beats blindly cutting every 500 tokens almost every time. I've seen a RAG system go from useless to genuinely good with no model change at all, purely by fixing how documents were chopped up.
Metadata, next, and this is where most teams leave the biggest win on the floor. Your chunks have properties — source, date, author, document type, department, version. If you're not capturing and filtering on those, you're asking a similarity search to do a librarian's job. "What's our current refund policy" should never surface the 2019 version, and no amount of semantic similarity reliably prevents that. A date filter does, instantly. Half the retrieval failures I see are really metadata failures wearing a trench coat.
Now the one people argue with me about: hybrid beats pure vector far more often than the vector-database marketing admits. Pure semantic search is genuinely good at meaning and genuinely bad at exact matches — product codes, error numbers, proper nouns, acronyms, that one specific term of art your industry uses. Someone searches for error code "E-4471" and a vector search helpfully returns chunks about errors in general. Old-fashioned keyword search — BM25, the technology that ran search engines for decades — nails the exact match every time. Run both, combine the rankings, and you get the meaning and the precision. It's more engineering than flipping on a vector store, which is precisely why people skip it and then wonder why exact-match queries fail.
Which brings me to the discipline that separates the projects that improve from the ones that plateau: evaluate retrieval separately from generation. These are two different systems and they fail differently. If you only judge the final answer, you can never tell whether a bad answer came from bad retrieval or bad generation, and so you can't fix it — you just keep swapping models and changing nothing that matters. Build a set of real questions with the documents that should answer them. Measure whether retrieval surfaces the right chunks — recall, precision, where the right chunk ranks. Get that solid first. Only then worry about how the model phrases things. Fix the search before you touch the prose.
And under all of it sits the work nobody wants: data hygiene. Duplicate documents, three contradictory versions of the same policy, scanned PDFs that OCR'd into gibberish, tables that flattened into word soup, stale content nobody flagged as dead. RAG over a messy knowledge base produces messy answers with total confidence. The least glamorous and most valuable thing you can do for a RAG project is clean the underlying data — deduplicate, version, fix the extraction, prune the dead weight. It's tedious and it's most of the actual job, and the teams that skip it are the teams that ship something embarrassing.
Last thing, and I mean it: sometimes you don't need RAG at all, and reaching for it is just complexity cosplay. If the relevant information fits comfortably in the context window, put it in the prompt — context is cheap now and a retrieval pipeline you don't need is a liability you have to maintain. If the knowledge is stable and bounded, fine-tuning may serve better than retrieving the same things forever. If the answer lives in a database or an API, give the model a tool call and let it fetch the precise answer rather than fuzzy-matching against embedded text. RAG is the right tool for a large, changing corpus of unstructured text you need to ground answers in. It is not a default, and treating it as one is how you end up maintaining a vector database to solve a problem a WHERE clause would've handled.
Get the search right and the AI part mostly takes care of itself. Get the search wrong and no model will save you. That's the whole lesson, and it predates the hype by about three decades.