RAG Done Right in 2026: Hybrid Search, Reranking, Evals

Production-grade RAG has six layers now. Most teams ship the first two and skip the last four. That's why your users complain about hallucinations and the demo founder thinks the model is the problem. The model is rarely the problem.
The six layers
1. Ingestion + chunking — turning source docs into searchable pieces 2. Embeddings + vector index — the part everyone gets right 3. Hybrid search — vector + lexical (BM25, full-text) together 4. Reranking — a cross-encoder step on the top-N candidates 5. Citation-required generation — the LLM must cite which chunk supports each claim 6. Evals as a CI gate — held-out test set that fails the build on regression
Skipping layers 3–6 is what makes "demo works, users complain" so common.
Chunking — the underrated part
Most retrieval failures we audit aren't retrieval failures. They're chunking failures. The right chunk doesn't exist in the index because chunking split the answer across two chunks.
What works:
Embeddings — pick once, then move on
Embedding-model selection in 2026 matters less than people think. Differences between top providers run under 5% on retrieval quality for typical use cases. What matters:
Store embeddings in pgvector on Postgres unless you have a reason not to. Pinecone, Weaviate, Qdrant all work. Pgvector is "just Postgres" with HNSW indexes — that's the right default for SaaS products.
Hybrid search — stop doing pure vector
Vector search is great at semantic similarity. "How do I cancel my subscription" matches "cancellation policy". It's bad at exact-match — product names, error codes, numbers. "Error E-4407" or "the X12 standard" — pure vector misses these.
Pure BM25 (lexical) misses semantic queries. You want both.
The pattern:
1. Run the user query through both vector retrieval AND lexical retrieval (Postgres full-text if you're on pgvector; OpenSearch or Elasticsearch if you have one) 2. Take the top N from each (we use N=20) 3. Combine with Reciprocal Rank Fusion — a 5-line algorithm — into a single ranked list of ~30 candidates 4. Hand the merged list to the reranker
This is the single biggest quality jump from naive RAG. If you do nothing else from this post, do hybrid search.
Get a take on your RAG stack
Ask ChatGPT to apply Start Matter's production RAG pattern to your specific corpus — what to chunk, what to embed, where to put the reranker.
“|”
Reranking — the cheapest win nobody bothers with
Top-K from hybrid search is good, not great. A cross-encoder reranker scores each (query, chunk) pair properly and reorders.
Why nobody does it: adds 200–500ms of latency and ~$0.0001 per query in inference cost. Why everyone should: improves top-3 precision by 20–40% in our measurements. That's the difference between users trusting the system and users complaining.
Production-ready rerankers in 2026:
Hand top 30 from hybrid search to the reranker. Get back top 5. Those go in the prompt.
Citation-required generation
The LLM must cite which chunk supports each claim. This is what stops hallucination from looking like fact.
The prompt pattern:
You are answering using only the provided context. For each factual claim in your answer, cite the chunk_id in square brackets like [chunk_3]. If the context doesn't support an answer, say "I don't know based on the provided documentation."
Plus a strict output schema. We use JSON output with a claims array, each claim having a citations field. UI renders the answer with linked footnotes back to source.
Two purposes: user can verify, and evals can check whether claims are grounded.
Evals — where most teams skip the work
Without evals you have no way to know when a prompt change, a model upgrade, or a corpus update broke retrieval. You ship a "better" model and quietly lose 20% accuracy.
Minimum viable eval setup:
Tools that help: Braintrust, Langfuse, OpenAI Evals, or a homegrown setup (under 300 lines of code). Tool matters less than the discipline to run on every change.
Want production RAG, not a demo?
We ship LLM features end to end — hybrid search, reranking, citation prompts, evals in CI. Flat price, 2–5 weeks. Send your email and we'll set up the call within 24 hours.
Cost-aware production
A working RAG at moderate scale (10K queries/day, 5K docs corpus) runs $200–$600/month all-in:
Dominant cost is LLM inference. Levers:
The 2026 stack we ship
For a typical SaaS RAG feature:
Not the only good stack. The one we'd argue for without a specific reason to deviate.
What we won't do
ReAct-style agents doing retrieval. Agents that decide when to retrieve are slower, more expensive, and rarely better than a planned pipeline. Use them only when the user task genuinely needs multi-step reasoning.
Fine-tuning before RAG. Almost never the right move in 2026. RAG handles "the model needs to know our docs" much better than fine-tuning does.
Custom embedding models. You don't need them.
Skipping evals "for now". Never happens. Build them on day one.
What this looks like for your build
1. Start with the eval set — 20–50 real questions, real ground-truth answers 2. Build the simplest working version (vector search + LLM with citation prompt + the evals) 3. Add hybrid search. Re-run evals. Recall@5 jumps 10–20 points. 4. Add reranker. Re-run evals. Another 10–20 point jump. 5. Add caching and model routing for cost.
Total build time for production-ready RAG: 1–3 weeks for a typical knowledge-base use case.
Per-city pages
Broader hire-vs-engage math: Hire vs Engage 2026.
Workflow that makes building this fast: How we use Claude Code in production.
Enjoyed this article? Share it with others
Related Posts

How We Use Claude Code in Production: Workflows, Costs, Anti-Patterns
We've shipped 30+ production builds with Claude Code as a primary tool in the last six months. Here's what works, what gets people in trouble, the exact prompt patterns we use, and what the monthly bill looks like.

How Vrinda Normand Scales Her Coaching Business with an AI Copywriting Agent (You Can Too)
Coaches and course creators often dream of scaling their business, helping more people, and finding new clients — without sacrificing their time or energy. But in reality, many are overwhelmed by endless tasks: writing content, responding to students, managing marketing...