Most RAG in 2026 is still built like it's 2023. Chunk the docs, embed them, run cosine similarity, stuff the top-K into the prompt. Works for the demo. Falls apart for real users. Wrong chunks retrieved, no citations, no evals to catch the regression.

Production-grade RAG has six layers now. Most teams ship the first two and skip the last four. That's why your users complain about hallucinations and the demo founder thinks the model is the problem. The model is rarely the problem.

The six layers

1. Ingestion + chunking — turning source docs into searchable pieces 2. Embeddings + vector index — the part everyone gets right 3. Hybrid search — vector + lexical (BM25, full-text) together 4. Reranking — a cross-encoder step on the top-N candidates 5. Citation-required generation — the LLM must cite which chunk supports each claim 6. Evals as a CI gate — held-out test set that fails the build on regression

Skipping layers 3–6 is what makes "demo works, users complain" so common.

Chunking — the underrated part

Most retrieval failures we audit aren't retrieval failures. They're chunking failures. The right chunk doesn't exist in the index because chunking split the answer across two chunks.

What works:

Semantic chunking, not character-count chunking. Use a chunker that respects structure — headings, paragraphs, list items. LangChain's RecursiveCharacterTextSplitter or LlamaIndex's SentenceSplitter both do this; configure them properly.

Chunks in the 300–800 token range for general text. Below 300 loses context. Above 800 hurts retrieval precision.

10–15% overlap. Catches answers that straddle chunk boundaries.

Metadata on every chunk. Source URL, section heading, page number, last-updated date. Used by retrieval (filtering) and by citation (showing the user where the answer came from).

For tables and code blocks: don't chunk them. Keep them whole even if they exceed your token budget. A half-chunked table is unusable.

Embeddings — pick once, then move on

Embedding-model selection in 2026 matters less than people think. Differences between top providers run under 5% on retrieval quality for typical use cases. What matters:

Use a provider you can route via Vercel AI Gateway or similar. Don't lock when prices or quality shift.

Pick one model and stay there. Re-embedding a corpus costs real money. Flip-flopping kills velocity.

OpenAI text-embedding-3-large or Voyage AI voyage-3-large for most cases. Cohere embed-v3 also fine.

Don't use the cheapest model. Cost difference between cheap and good is rounding error compared to LLM inference cost. Quality difference is real.

Store embeddings in pgvector on Postgres unless you have a reason not to. Pinecone, Weaviate, Qdrant all work. Pgvector is "just Postgres" with HNSW indexes — that's the right default for SaaS products.

Hybrid search — stop doing pure vector

Vector search is great at semantic similarity. "How do I cancel my subscription" matches "cancellation policy". It's bad at exact-match — product names, error codes, numbers. "Error E-4407" or "the X12 standard" — pure vector misses these.

Pure BM25 (lexical) misses semantic queries. You want both.

The pattern:

1. Run the user query through both vector retrieval AND lexical retrieval (Postgres full-text if you're on pgvector; OpenSearch or Elasticsearch if you have one) 2. Take the top N from each (we use N=20) 3. Combine with Reciprocal Rank Fusion — a 5-line algorithm — into a single ranked list of ~30 candidates 4. Hand the merged list to the reranker

This is the single biggest quality jump from naive RAG. If you do nothing else from this post, do hybrid search.

Get a take on your RAG stack

Ask ChatGPT to apply Start Matter's production RAG pattern to your specific corpus — what to chunk, what to embed, where to put the reranker.

“|”

Reranking — the cheapest win nobody bothers with

Top-K from hybrid search is good, not great. A cross-encoder reranker scores each (query, chunk) pair properly and reorders.

Why nobody does it: adds 200–500ms of latency and ~$0.0001 per query in inference cost. Why everyone should: improves top-3 precision by 20–40% in our measurements. That's the difference between users trusting the system and users complaining.

Production-ready rerankers in 2026:

Cohere rerank-v3 — best quality we've measured. ~$0.002/query at typical batches.

BAAI bge-reranker-v2-m3 — open source. Self-hostable. Slower. Free at scale.

Voyage rerank-2 — close second to Cohere.

Hand top 30 from hybrid search to the reranker. Get back top 5. Those go in the prompt.

Citation-required generation

The LLM must cite which chunk supports each claim. This is what stops hallucination from looking like fact.

The prompt pattern:

You are answering using only the provided context. For each factual claim in your answer, cite the chunk_id in square brackets like [chunk_3]. If the context doesn't support an answer, say "I don't know based on the provided documentation."

Plus a strict output schema. We use JSON output with a claims array, each claim having a citations field. UI renders the answer with linked footnotes back to source.

Two purposes: user can verify, and evals can check whether claims are grounded.

Evals — where most teams skip the work

Without evals you have no way to know when a prompt change, a model upgrade, or a corpus update broke retrieval. You ship a "better" model and quietly lose 20% accuracy.

Minimum viable eval setup:

50–200 held-out question-answer pairs. Real questions from your users. Ground-truth answers from a domain expert.

Per-question metrics: did the right chunk get retrieved in top-5 (recall@5); did the generated answer match ground truth (semantic similarity + human-approved variants); were the citations valid (citation precision).

CI gate that fails the build if any metric drops by more than 5%.

Tools that help: Braintrust, Langfuse, OpenAI Evals, or a homegrown setup (under 300 lines of code). Tool matters less than the discipline to run on every change.

Want production RAG, not a demo?

We ship LLM features end to end — hybrid search, reranking, citation prompts, evals in CI. Flat price, 2–5 weeks. Send your email and we'll set up the call within 24 hours.

Cost-aware production

A working RAG at moderate scale (10K queries/day, 5K docs corpus) runs $200–$600/month all-in:

Embeddings (one-time + updates): $5–30/month after the initial backfill

Vector storage (pgvector): rounding error on managed Postgres

LLM inference: $80–300/month at Claude Sonnet or GPT-4o prices, depending on input length

Reranker inference: $20–100/month

Eval runs: $5–20/month

Dominant cost is LLM inference. Levers:

Cache the LLM response for repeated queries (Vercel runtime cache with query+top-3-chunk-ids as key) — kills 30–50% of cost in most apps

Route queries by complexity (Haiku for simple lookups, Sonnet for synthesis) via AI Gateway

Token budgets per route — never let a single query exceed the budget

The 2026 stack we ship

For a typical SaaS RAG feature:

Vercel AI SDK for orchestration (streaming, tool calls, structured outputs)

AI Gateway for model routing, fallback, budget control

Anthropic Claude primary (Sonnet 4.6 default, Haiku for cost-sensitive paths)

OpenAI text-embedding-3-large for embeddings

Postgres + pgvector for vector store

Postgres FTS for the lexical side

Cohere rerank-v3 for the reranker

Braintrust or homegrown evals in CI

Not the only good stack. The one we'd argue for without a specific reason to deviate.

What we won't do

ReAct-style agents doing retrieval. Agents that decide when to retrieve are slower, more expensive, and rarely better than a planned pipeline. Use them only when the user task genuinely needs multi-step reasoning.

Fine-tuning before RAG. Almost never the right move in 2026. RAG handles "the model needs to know our docs" much better than fine-tuning does.

Custom embedding models. You don't need them.

Skipping evals "for now". Never happens. Build them on day one.

What this looks like for your build

1. Start with the eval set — 20–50 real questions, real ground-truth answers 2. Build the simplest working version (vector search + LLM with citation prompt + the evals) 3. Add hybrid search. Re-run evals. Recall@5 jumps 10–20 points. 4. Add reranker. Re-run evals. Another 10–20 point jump. 5. Add caching and model routing for cost.

Total build time for production-ready RAG: 1–3 weeks for a typical knowledge-base use case.