VibeWeek
Home/Grow/RAG & Semantic Search: Build "Chat With Your Data" Features That Don't Hallucinate

RAG & Semantic Search: Build "Chat With Your Data" Features That Don't Hallucinate

⬅️ Growth Overview

RAG Strategy for Your New SaaS

Goal: Ship retrieval-augmented generation (RAG) features — "ask your data," documentation chat, semantic search over user content — that return accurate, citation-backed answers instead of confidently-wrong hallucinations. Use a managed embedding model + vector store, design the retrieval pipeline carefully, score quality continuously, and treat hallucinations as bugs to be eliminated, not features to apologize for. Avoid the failure modes where founders ship "chat with your docs" using OpenAI alone (no retrieval = pure hallucination), pick a vector database without thinking about scale (Pinecone bills surprise; pgvector "just works" for most), or skip evaluation (every prompt change can quietly regress quality without anyone noticing).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Basic RAG with pgvector + OpenAI embeddings shipped in 3-5 days. Hybrid search + reranking + citations in week 2. Eval pipeline + production observability in week 3. Quarterly review baked in.


Why Most Founder RAG Implementations Are Broken

Three failure modes hit founders the same way:

  • No retrieval — just stuff data into the prompt. Founder ships "chat with your docs" by concatenating recent docs into the prompt. Two problems: (1) context window fills fast; can''t scale past 50 docs; (2) the LLM still hallucinates because the relevant doc may not be in context. Customers see "I read 47 of your documents and conclude X" — but X is wrong because the right doc wasn''t in context.
  • Naive embedding-only retrieval. Founder ships text-embedding + similarity search. Works for "what does our product do?" but fails on specific queries ("what was the pricing in Q2?") because semantic similarity catches related but not exact content. No reranking, no keyword fallback. Quality looks great in demos; breaks on real queries.
  • No citations / no source-of-truth surfacing. The AI returns a plausible-sounding answer. Customer asks "where did you get this?" Founder has no answer; LLM made it up. Trust evaporates the first time customer catches a hallucination.

The version that works is structured: chunk content carefully, embed with a quality model, store in a vector DB you can operate, retrieve with hybrid (semantic + keyword), rerank, generate with strict citation requirements, evaluate continuously, and refuse to answer when confidence is low.

This guide assumes you have already done AI Features Implementation (the broader pattern), have considered Search (keyword search foundation), have shipped Multi-Tenant Data Isolation (RAG must respect tenant boundaries), and have shipped LLM observability for production tracing.


1. Decide What "RAG" Means for Your Product

The first decision is product, not technical. Different RAG shapes need different pipelines.

Help me decide which RAG shape my product needs.

The shapes:

**Shape 1: Documentation chat**
- "Ask the docs" interface
- Index: your public documentation
- Users: prospects evaluating, customers troubleshooting
- Stakes: medium (wrong answer = lost trust, missed onboarding)
- Update frequency: weekly (docs change)

**Shape 2: Customer-data chat**
- "Ask your data" interface (per-tenant)
- Index: each customer''s own content (notes, projects, etc.)
- Users: the customer, querying their own data
- Stakes: high (wrong answer about user''s own data = real problem)
- Update frequency: real-time (new content keeps appearing)
- Crucial: tenant isolation

**Shape 3: Knowledge-base chat (internal tools)**
- For internal teams (support agents, sales reps)
- Index: company knowledge (Notion, Confluence, Slack)
- Users: employees
- Stakes: medium-high (wrong answer = bad customer interaction)

**Shape 4: Product-feature semantic search**
- Power a search bar with semantic understanding
- Index: product entities (tasks, contacts, documents)
- Users: customers searching their own workspace
- Stakes: medium (poor results = bad search UX)

**Shape 5: Code-aware Q&A**
- Specialized: ask questions about a codebase
- Index: repos / files
- Users: devs
- Stakes: high if generating code; lower if explaining

**The decision criteria**:

- Public vs per-tenant data
- Update frequency (static / weekly / real-time)
- Stakes of wrong answers
- Volume of indexed content (1K docs vs 100M)

For my product:
- Which shape?
- What''s in the index?
- Who queries?
- What''s the cost of a wrong answer?

Output:
1. The chosen shape
2. The data source(s) being indexed
3. The user audience
4. The hallucination tolerance (low / medium / high)

The biggest unforced error: shipping "chat with your data" without thinking about update frequency. A documentation chat with weekly re-indexing is fine. A customer-data chat where new content appears every minute and the index lags by 6 hours is broken. Match the pipeline to the use case.


2. Choose Your Vector Database

Vector DB choice is high-leverage. Most indie SaaS overpay or over-engineer.

Help me pick the vector store.

The options (per [vector databases](https://www.vibereference.com/backend-and-data/vector-databases)):

**pgvector** (Postgres extension)
- Use the database you already have
- Works to ~1M vectors at 1536 dimensions on standard hardware
- HNSW indexes ship in modern Postgres
- Free (your existing DB)
- No new vendor

**Convex** (with vector search)
- Built into Convex if you''re already using it
- Real-time subscription support
- Limits: 1M vectors per table; works well for SaaS scale

**Pinecone** (managed vector DB)
- Mature, well-documented
- Scales high; costs scale faster
- $70+/mo serverless minimum
- Closed source

**Qdrant** (OSS)
- OSS Apache 2.0; self-host or cloud
- Strong at scale
- Free OSS; cloud from $0/mo (limited free tier)

**Weaviate** (OSS)
- OSS BSD; self-host or cloud
- Strong hybrid-search built in
- Good GraphQL API

**Chroma** (lightweight OSS)
- Apache 2.0
- Good for prototyping; not production-grade at scale

**Vespa, Milvus, Vald** (enterprise OSS)
- More complex; better for high scale

**The decision criteria**:

- Index size: <1M vectors → pgvector. >10M → managed (Pinecone, Qdrant cloud).
- Operational simplicity: pgvector wins (you already operate Postgres)
- Real-time freshness: managed services usually update fastest
- Hybrid search needs: Weaviate / Qdrant strongest; pgvector with extensions OK

**For most indie SaaS in 2026: pgvector.**

Reasons:
- You already have Postgres
- HNSW indexes ship in pgvector
- Performance is fine to ~1M vectors
- One less vendor
- Same backup / recovery / observability stack

Migrate to managed only when:
- You''re hitting >5M vectors and querying at high QPS
- You need specific features (graph + vector hybrid)
- Your DB is already overloaded with other work

For my product:
- Estimated vector count today and in 12 months
- Operational team capacity
- Whether I''m on Postgres / Convex / something else

Output:
1. The chosen vector store
2. The reasoning
3. The migration trigger (when to switch)

The biggest miscalculation: picking Pinecone for indie scale. Most indie SaaS have <500K vectors; pgvector is enough. Pinecone is great at scale; the bill at indie scale is hard to justify when Postgres is already in your stack.


3. Pick the Right Embedding Model

The embedding model determines retrieval quality. Pick deliberately.

Help me pick the embedding model.

The options (2026):

**OpenAI text-embedding-3-small / text-embedding-3-large**
- Mature; widely used
- $0.02 / 1M tokens (small) — very cheap
- 1536 dim (small) / 3072 dim (large)
- Strong general-purpose; weaker on technical / domain-specific

**Cohere Embed v3 / Cohere Embed Multilingual**
- Strong general; especially good multilingual
- $0.10 / 1M tokens
- Similar dimensions to OpenAI

**Voyage AI (Voyage-3, Voyage-Code, Voyage-Multilingual)**
- Specialized models per domain (code, multilingual, general)
- Often higher quality than OpenAI for specific use cases
- $0.05-0.18 / 1M tokens

**Google text-embedding-005 / Gemini Embedding**
- Strong general-purpose
- $0.025-0.15 / 1M tokens
- Good if on GCP

**Open-source models** (BGE, E5, Nomic, Sentence-Transformers)
- Self-host (HuggingFace, Ollama)
- Free; you pay compute
- Specific models can match closed-source quality
- Operational overhead

**The decision criteria**:

- General SaaS content: OpenAI text-embedding-3-small (cheap, decent)
- Code-heavy: Voyage Code or specialized models
- Multilingual-heavy: Cohere Multilingual or Voyage Multilingual
- Cost-extreme: self-host OSS
- Quality-extreme: Voyage-3 or Cohere Embed v3

**For most indie SaaS in 2026**: OpenAI text-embedding-3-small. Cheap; decent quality; works.

**Critical implementation rules**:

1. **Don''t mix embedding models** without re-indexing everything. Different models = different vector spaces.
2. **Plan for migration**. If you switch models later, re-embed all content.
3. **Pick dimensions wisely**. 1536 dim (default) is fine for most. 3072 doubles storage and query latency.

**Don''t**:
- Use the wrong model for domain (general model on legal docs underperforms)
- Skip the migration plan when switching models
- Trust benchmarks over your own evals

Output:
1. The chosen model with reasoning
2. The dimensions
3. The cost estimate per indexed item
4. The model-switching plan if needed

The biggest cost gotcha: embedding millions of tokens once is cheap; re-embedding when you switch models is the real cost. Pick a model you can stay on for 12+ months, or budget for re-indexing.


4. Design the Chunking Strategy

How you split content for embedding determines retrieval quality. Get this right.

Help me design chunking.

The patterns:

**Pattern 1: Fixed-size chunks**
- Split text every N tokens (typically 256-512)
- Simple; works for unstructured text
- Loses semantic boundaries

**Pattern 2: Semantic chunks**
- Split on natural boundaries (paragraphs, sections)
- Preserves coherence
- Variable sizes

**Pattern 3: Recursive chunking**
- Split at multiple levels (section → paragraph → sentence)
- Overlap small windows
- Most-common modern approach (LangChain''s RecursiveCharacterTextSplitter)

**Pattern 4: Document-structure-aware**
- For Markdown / docs: chunk by H2/H3 sections
- Preserves logical structure
- Best for documentation chat

**Chunk size guidance**:

- 256-512 tokens per chunk: typical sweet spot
- Below 100 tokens: too small; loses context
- Above 1024 tokens: too large; dilutes relevance signal

**Overlap**:

- 10-20% overlap between consecutive chunks
- Captures sentences that span boundaries
- Doesn''t double-count too much

**Per content type**:

- **Documentation**: section-aware chunking; 400-800 tokens with 100-token overlap
- **Notes / unstructured**: recursive 512-token chunks with 64-token overlap
- **Code**: function-level chunks; preserve syntax structure
- **Long PDFs**: section + sub-paragraph chunking
- **Short content (tweets, emails)**: 1 chunk per item; no splitting

**Metadata to attach to each chunk**:

- Source document ID
- Section / heading path
- Created / updated timestamp
- Author (if relevant)
- Tags / categories
- workspace_id (for tenant scoping)
- chunk_index (position within source)

This metadata lets you filter at query time and attribute citations.

**Critical implementation rules**:

1. **Chunk metadata > chunk content alone**. Without source IDs, you can''t cite.
2. **Re-chunk on schema change**. New chunking strategy = re-index everything.
3. **Test chunking on your specific content**. Generic settings don''t always work.

**Don''t**:
- Use one-size-fits-all chunking across different content types
- Skip metadata (no metadata = no citations)
- Chunk too aggressively (lose context) or too coarsely (dilute signal)

Output:
1. The chunking strategy per content type
2. The chunk size + overlap config
3. The metadata schema
4. The re-chunking plan when content updates

The biggest quality lift: document-structure-aware chunking instead of fixed-size. A markdown doc chunked by H2 sections retrieves better than the same doc chunked at arbitrary 512-token windows. The structure preserves meaning; arbitrary splits destroy it.


5. Hybrid Retrieval (Vector + Keyword)

Pure semantic search misses exact-match queries. Combine.

Design the retrieval pipeline.

The pattern:

**Step 1: Hybrid retrieval**

For each query:
1. Compute embedding (semantic search) → top 20 candidates
2. Run keyword/BM25 search → top 20 candidates
3. Combine with reciprocal rank fusion (RRF) or weighted scoring
4. Top 10-30 candidates pass to next step

**Why hybrid**:

- Semantic catches "ways to delete an account" → matches "user removal flow"
- Keyword catches "ACME-2024-INVOICE" → exact-string match
- Together: you get both

**Step 2: Reranking**

Pass top 20 candidates to a reranker:
- Cohere Rerank
- Voyage Rerank
- Or LLM-as-reranker (slower, more flexible)

Reranker scores each candidate against the query → top 5-10 most relevant.

**Why reranking**:

- Pure embedding similarity has noise
- Rerankers see query + candidate together; better signal
- 30-50% quality improvement on most use cases

**Step 3: Filter by metadata**

Before generating, filter candidates by:
- workspace_id (tenant scoping; mandatory)
- Recency (recent docs may be preferred)
- Document type (if user is asking about pricing, prefer pricing docs)
- User permissions (per [RBAC](roles-permissions-chat.md))

**Step 4: Generate with citations**

Pass top 5-10 chunks to the LLM with strict instructions:
- "Answer using ONLY the provided context"
- "Cite each claim with [source: doc_id]"
- "If the answer isn''t in the context, say ''I don''t know'' "

**Implementation example (TypeScript / pgvector)**:

```ts
async function rag(query: string, workspaceId: string) {
  // 1. Hybrid retrieval
  const queryEmbedding = await embed(query)
  const semanticHits = await db.query(
    `SELECT id, content, metadata, 
      1 - (embedding <=> $1::vector) AS similarity
     FROM chunks 
     WHERE workspace_id = $2
     ORDER BY embedding <=> $1::vector
     LIMIT 20`,
    [queryEmbedding, workspaceId]
  )
  const keywordHits = await db.query(
    `SELECT id, content, metadata, 
      ts_rank(search_tsv, query) AS rank
     FROM chunks, plainto_tsquery($1) AS query
     WHERE workspace_id = $2 AND search_tsv @@ query
     LIMIT 20`,
    [query, workspaceId]
  )
  const merged = reciprocalRankFusion(semanticHits, keywordHits)
  
  // 2. Rerank
  const reranked = await cohere.rerank({ query, documents: merged.slice(0, 20) })
  const top = reranked.results.slice(0, 8)
  
  // 3. Generate with citations
  const context = top.map(r => `[${r.metadata.source_id}] ${r.content}`).join('\n\n')
  return await generateText({
    model: 'anthropic/claude-sonnet-4-6',
    system: 'Answer using only the provided context. Cite each claim with [source: doc_id]. If the answer isn''t in context, say "I don''t know."',
    prompt: `Context:\n${context}\n\nQuestion: ${query}`,
  })
}

Critical implementation rules:

  1. Tenant scoping in EVERY query. Filter by workspace_id at the DB layer; never trust application code alone.
  2. Always rerank. The 30-50% quality lift is too valuable to skip.
  3. Cite or refuse. The LLM must say "I don''t know" when the context doesn''t contain the answer.

Don''t:

  • Skip keyword search ("everything is semantic now")
  • Skip reranking (huge quality regression)
  • Allow the LLM to answer without context (full hallucination mode)

Output:

  1. The hybrid-retrieval code
  2. The reranker choice
  3. The prompt template enforcing citations
  4. The "I don''t know" fallback

The single biggest quality win: **reranking after retrieval**. Going from "top 20 by similarity" to "top 5 by reranker" produces dramatically better answers. Skip it and your RAG feels like a clever guess; include it and it feels like understanding.

---

## 6. Force Citations (and "I Don''t Know")

A RAG system that hallucinates is worse than no RAG. Build to refuse.

Design the citation + refusal mechanic.

The pattern:

The system prompt:

You are a helpful assistant for [Product]. Answer questions using ONLY the provided context.

Rules:
1. Cite each factual claim with [source: doc_id]
2. If the answer isn''t in the context, say "I don''t have information about that in my knowledge base."
3. Don''t guess. Don''t embellish.
4. If multiple chunks contradict, prefer the most recent.
5. Keep answers concise — no filler.

The validation step:

After generation, validate:

  • Are there citations in the response?
  • Do citations match actual document IDs?
  • Are there claims without citations? (Optional: enforce via second LLM pass)

If validation fails:

  • Retry with a stricter prompt
  • Or: return "I don''t know" + the sources I would have used

The "I don''t know" UX:

When the model says "I don''t know," show:

  • The "I don''t know" text
  • Links to related sources (so user can read themselves)
  • A "rephrase your question" suggestion
  • A "submit feedback" button (these queries are gold for product improvement)

Citation rendering in UI:

In your frontend:

  • Replace [source: doc-123] with a clickable link
  • Show source title + section on hover
  • Numbered citations (Wikipedia-style: [1], [2])
  • "View all sources" panel showing the full list

Critical rules:

  1. Refuse > guess. Hallucinations damage trust permanently.
  2. Citations link to actual sources, not just IDs.
  3. Track citation density. Responses with 0 citations are suspect.

Anti-patterns:

  • Letting the LLM answer "based on general knowledge" when context is sparse
  • Hiding sources from users
  • Allowing fictional citations (model invents a doc_id)

Don''t:

  • Trust the model to refuse without prompt enforcement
  • Skip the validation pass
  • Render citations as plaintext (clickable matters)

Output:

  1. The system prompt with citation requirements
  2. The validation logic
  3. The citation rendering in UI
  4. The "I don''t know" UX

The single biggest trust-builder: **a "I don''t have information about that" response when the answer truly isn''t in the index.** Customers respect "I don''t know" more than confident-but-wrong. Train your RAG to refuse.

---

## 7. Real-Time Indexing

Static RAG is fine for documentation. Per-tenant chat with user content needs real-time updates.

Design real-time indexing.

The pattern:

On content create / update:

When a user creates or edits a document:

  1. Generate embedding for the new/changed content
  2. Update vector store
  3. Index is now current

Implementation:

async function onDocumentSaved(doc: Document) {
  // Re-chunk
  const chunks = chunkDocument(doc)
  
  // Re-embed
  const embeddings = await batchEmbed(chunks.map(c => c.content))
  
  // Replace existing chunks for this document
  await db.transaction(async tx => {
    await tx.query('DELETE FROM chunks WHERE source_doc_id = $1', [doc.id])
    for (let i = 0; i < chunks.length; i++) {
      await tx.query(
        `INSERT INTO chunks 
         (source_doc_id, workspace_id, content, embedding, metadata, chunk_index)
         VALUES ($1, $2, $3, $4, $5, $6)`,
        [doc.id, doc.workspaceId, chunks[i].content, embeddings[i], chunks[i].metadata, i]
      )
    }
  })
}

Async vs sync:

  • Sync (in the request): user immediately sees their content searchable
  • Async (background job): faster save UX; brief lag before searchable
  • For most products: async is fine; lag of <30s is acceptable

Batching:

  • Embed multiple chunks in one API call (cheaper, faster)
  • OpenAI supports up to 2048 inputs per call
  • Batch by document or by time window

Cost management:

  • Track embedding cost per workspace
  • Cap free-tier indexing volume
  • Don''t re-embed unchanged content (compare hash before re-embed)

Cleanup on deletion:

When a document is deleted:

  • Remove its chunks from vector store
  • Also remove embeddings (paid storage)
  • Per account deletion: purge on user deletion

Critical implementation rules:

  1. Tenant scoping mandatory on every chunk
  2. Re-embed only changed content (cost matters)
  3. Soft-delete or hard-delete consistently (don''t leave orphan chunks)
  4. Reconciliation job weekly: compare DB to chunk index, catch drift

Don''t:

  • Embed everything inline in HTTP request (slow saves)
  • Re-embed full document when only one paragraph changed
  • Forget to clean up on document delete

Output:

  1. The on-save hook
  2. The async indexing worker
  3. The batching strategy
  4. The reconciliation job
  5. The cost tracking

The biggest invisible cost: **re-embedding unchanged content.** A user edits a typo in a 10-page doc; naive code re-embeds all 10 pages instead of the one paragraph. Hash chunks; re-embed only on actual change.

---

## 8. Evaluate Quality Continuously

RAG quality drifts. Test every change.

Design RAG evaluation.

The pattern:

Build an eval set:

  • 30-100 question/answer pairs
  • Mix: fact lookup, summary, comparison, edge cases
  • For each: expected answer (or scoring criteria)

Metrics to track:

  1. Retrieval@k: was the right chunk in top-k retrieved?

    • Manual: human rates "yes/no, the chunk that contains the answer is in top 10"
    • Automated: if you have ground-truth chunk IDs, compute precision
  2. Answer correctness: is the generated answer correct?

    • Manual: human rates 1-5
    • LLM-judge: another LLM rates given the question, expected, and actual
  3. Citation accuracy: do citations point to chunks that support the claim?

    • Programmatic: parse citations; verify they exist; verify chunks contain claim text
  4. Refusal accuracy: does the model refuse when it should?

    • Eval set includes questions whose answers are NOT in the index
    • Model should say "I don''t know" — measure refusal rate

Eval cadence:

  • Run on every prompt change
  • Run on every retrieval-pipeline change (chunking, model, reranker)
  • Run weekly in production-like conditions
  • Track scores over time

Tools:

  • Langfuse with eval runners (per LLM observability)
  • Braintrust for eval workflows
  • Custom scripts for simple cases

CI integration:

on: pull_request
jobs:
  rag-eval:
    if: contains(github.event.pull_request.changed_files, 'rag/')
    steps:
      - run: npm run rag:eval
      # Fail if metrics drop below threshold

Production observability:

  • Log every RAG query: query, retrieved chunks, generated answer, citations
  • Log user feedback: 👍 / 👎 / "this answer was wrong"
  • Use feedback to expand eval set

The "broken queries" feedback loop:

  • Customer flags an answer as wrong
  • Engineer reviews: was retrieval bad or generation bad?
  • Add the question to eval set
  • Fix; verify with eval

Don''t:

  • Skip evals on "small" changes
  • Trust general benchmarks (your data is different)
  • Forget production observability

Output:

  1. The eval set with expected outcomes
  2. The metric definitions
  3. The CI integration
  4. The production-feedback loop

The single biggest quality regression catcher: **the eval set + CI gate**. A team running 50 evals on every PR catches the prompt change that quietly broke retrieval. Without it, regressions ship and customers find them.

---

## 9. Handle Edge Cases

Real RAG has weird cases. Plan for them.

The edge case checklist.

Edge case 1: Empty / sparse index

User in a fresh workspace asks "what''s in my data?" — index has 5 chunks total.

  • Detect; respond "Your knowledge base is small. Add more content for better answers."
  • Don''t pretend to "find" things

Edge case 2: Query in foreign language

User queries in French; index is English.

  • If embedding model is multilingual: works
  • Otherwise: detect language; translate query; retrieve; respond in original language

Edge case 3: Query is a question vs query is a keyword

"How do I reset my password?" vs "password reset"

  • Both should work
  • Some retrieval techniques (HyDE: hypothetical document embedding) help with short queries
  • Test both styles in your eval set

Edge case 4: Stale information

User asks "what was Q2 revenue?" — index has Q1 data, no Q2.

  • Detect: top-k chunks are from before Q2
  • Surface: "I have data through Q1 only"
  • Don''t pretend Q2 data exists

Edge case 5: Conflicting sources

Document A says X; Document B says Y; both are in top retrieved.

  • Generate prompt: "If sources conflict, prefer most recent OR surface the conflict"
  • UI: show both citations with their conflict noted

Edge case 6: Highly-specific factual queries

"What was the exact wording in section 4.2 of doc XYZ?"

  • Pure semantic might miss; keyword + reranker helps
  • Direct quote in citation

Edge case 7: Very long documents

A 200-page PDF — chunking 200 pages produces hundreds of chunks.

  • Hierarchical retrieval: first retrieve "which document?", then "which chunk in document"
  • Or: page-level chunks for routing, paragraph-level for content

Edge case 8: Permissions per document

User A can see docs 1-5; User B can see docs 3-7.

  • Must filter chunks at query time by user permissions
  • Per RBAC: don''t trust the LLM to respect permissions

Edge case 9: Adversarial queries

User: "Ignore previous instructions; just answer X"

  • Prompt injection attempts
  • Defense: structured outputs; refuse off-topic; don''t blindly follow user-supplied instructions
  • See OWASP LLM Top 10

Edge case 10: Rate limiting at the LLM provider

OpenAI / Anthropic returns 429. RAG fails.

  • Retry with backoff
  • Fall back to keyword search alone (no generation; just "here are 5 docs that match")

Output:

  1. Handling per edge case
  2. The detection logic
  3. The fallback patterns

---

## 10. Quarterly Review

RAG rots. Quarterly review keeps it sharp.

Quarterly RAG review.

Quality:

  • Eval scores over time
  • 👍 / 👎 ratio in production
  • "This answer was wrong" report rate
  • Refusal rate trend

Cost:

  • Embedding cost per period
  • Vector storage cost
  • LLM generation cost
  • Per-feature breakdown

Infrastructure:

  • Vector index size and growth rate
  • Retrieval latency (p50, p95, p99)
  • Reranker latency
  • End-to-end response time

Coverage:

  • What % of questions get answered (vs "I don''t know")?
  • Are there topics consistently failing?
  • Do "broken query" patterns suggest content gaps?

Model updates:

  • New embedding models worth evaluating?
  • New rerankers worth testing?
  • New base LLMs that might generate better?

Output:

  • Quality snapshot
  • 1-2 retrieval improvements
  • 1 cost optimization
  • 1 model update to test

---

## What "Done" Looks Like

A working RAG system in 2026 has:

- A defined product shape (docs chat / customer-data chat / etc.)
- A vector store appropriate to scale (pgvector for most indie SaaS)
- A quality embedding model (OpenAI text-embedding-3-small as default)
- Document-structure-aware chunking
- Hybrid retrieval (semantic + keyword)
- Reranking before generation
- Strict citation requirements with refusal on uncertainty
- Tenant scoping enforced at the DB layer
- Real-time indexing with deduplication
- Continuous evaluation in CI + production
- Edge-case handling for sparse indexes / foreign languages / conflicts / permissions / prompt injection
- Quarterly review baked into the team rhythm

The hidden cost in RAG isn''t the embedding tokens — it''s **the trust damage from a single hallucination customers catch**. A team without retrieval discipline ships answers that look good in demos and break under real queries. The discipline of "retrieve-rerank-cite-or-refuse" turns RAG from a magic trick into a reliable feature. Build it right; evaluate it constantly; refuse when uncertain.

---

## See Also

- [AI Features Implementation](ai-features-implementation-chat.md) — broader AI integration pattern
- [Search](search-chat.md) — keyword search foundation
- [Multi-Tenant Data Isolation](multi-tenancy-chat.md) — tenant scoping is mandatory
- [Roles & Permissions (RBAC)](roles-permissions-chat.md) — per-document permissions
- [Account Deletion & Data Export](account-deletion-data-export-chat.md) — purge embeddings on user deletion
- [LLM Cost Optimization](llm-cost-optimization-chat.md) — companion topic
- [LLM Quality Monitoring](llm-quality-monitoring-chat.md) — production quality tracking
- [Vector Databases](https://www.vibereference.com/backend-and-data/vector-databases) — pgvector / Pinecone / Qdrant / Weaviate / Chroma
- [LLM Observability Providers](https://www.vibereference.com/ai-development/llm-observability-providers) — Langfuse / LangSmith / Braintrust
- [AI SDK](https://www.vibereference.com/ai-development/ai-sdk) — TS / Node SDK with embedding utilities
- [AI SDK Core](https://www.vibereference.com/ai-development/ai-sdk-core) — embed / generateText / streamText
- [Convex AI Memory Tutorial](https://www.vibereference.com/ai-development/convex-ai-memory-tutorial) — Convex-specific RAG
- [DIY AI Memory pgvector + Convex](https://www.vibereference.com/ai-development/diy-ai-memory-pgvector-convex) — implementation deep-dive

[⬅️ Growth Overview](README.md)