VibeWeek
Home/Grow/In-Product AI Search & Q&A on Customer Data — Chat Prompts

In-Product AI Search & Q&A on Customer Data — Chat Prompts

⬅️ Back to 6. Grow

If your B2B SaaS has any customer data — projects, documents, conversations, tickets, contacts, files — by 2026 customers will expect AI-powered search + Q&A across THEIR data. Linear's AI search across issues, Notion's AI Q&A across workspace, Slack's AI summarization across channels, Linear AI assistants — every modern SaaS ships some version. The competitive frontier moved from "we have search" to "I can ask my product anything in natural language and it answers from my data."

This is distinct from In-Product Help Center & Knowledge Base (AI on YOUR docs) and from Workplace AI Search Tools (Reference) (Glean / Dust — search across the customer's many tools). This is AI search + Q&A within YOUR product on the customer's data inside YOUR product.

The naive shape: bolt OpenAI on top of full-text search. Works for v0 demo; produces hallucinations + slow + costly + privacy concerns at scale. The right shape: hybrid retrieval (lexical + semantic) + ranking + LLM synthesis with citations + permission enforcement + cost controls + evaluation.

This chat walks through implementing real in-product AI search + Q&A: data model, indexing pipeline, retrieval architecture, LLM synthesis, permissions, citations, cost controls, and the operational realities.

What you're building

  • Search index across customer's data (projects, docs, tickets, etc.)
  • Hybrid retrieval (lexical + semantic embeddings)
  • Permission-aware retrieval (don't surface data the user can't see)
  • LLM synthesis with citations
  • Streaming UI (token-by-token answer)
  • "Ask anything" UI surface
  • Cost controls (per-customer, per-query budget)
  • Evaluation harness (regression-test answer quality)
  • Operational concerns (indexing lag, permission changes, deletion)

1. Decide the scope BEFORE building

Help me decide what shape of in-product AI search to ship.

Three increasingly-deep shapes:

LEVEL 1: AI-AUGMENTED SEARCH (the simplest start)
- Keep existing search; add AI summary + answer at top
- "Smart answer: here's what I found..."
- Falls back to search results if AI uncertain
- Pros: 4-6 weeks; clear UX
- Cons: limited; doesn't reason across documents deeply

LEVEL 2: NATURAL-LANGUAGE Q&A
- Customer types question in natural language
- AI retrieves relevant documents + synthesizes answer
- Cites specific documents
- "Talk to your data" experience
- Pros: 8-16 weeks; high customer value
- Cons: more engineering; eval discipline; cost

LEVEL 3: AGENTIC IN-PRODUCT ASSISTANT
- Beyond Q&A: takes ACTIONS in product based on questions
- "Reschedule all meetings with Acme to next week"
- "Summarize this project + create follow-up tasks"
- Pros: deeply integrated; productivity multiplier
- Cons: 24+ weeks; safety + permissions complex; see [In-Product AI Agent Implementation](./in-product-ai-agent-implementation-chat)

DEFAULT FOR MOST B2B SaaS:
- Year 1 with AI features: Level 1 (AI summary on top of search)
- Year 1+: Level 2 (full Q&A) when scope is clear
- Year 2+: Level 3 (agentic) when Level 2 has earned trust
- Don't pre-build Level 3

Output: explicit scope; what's in v1; what's not.

Output: scope statement.

2. Design the indexing pipeline

For AI Q&A, you need fast retrieval across customer data. That means an index.

Architecture:

Source data:
- Live in primary DB (Postgres / your stack)
- Includes: documents, projects, tasks, comments, etc. (whatever is searchable)

Index data:
- Lexical: Postgres FTS, Algolia, Typesense, or OpenSearch
- Semantic: vector embeddings stored in Postgres (pgvector), Pinecone, Weaviate, or your vector DB
- Hybrid retrieval: combine both

Indexing pipeline:

1. Source change event (e.g., document_updated)
2. Push to index queue
3. Worker:
   - Fetches updated document
   - Chunks document if large (e.g., 500-1000 tokens per chunk)
   - Generates lexical index entry (insert or update)
   - Generates embedding (per chunk)
   - Stores in vector DB
   - Updates indexed_at timestamp on source

Schema:

search_index_lexical (
  id              uuid pk
  workspace_id    uuid
  resource_type   text
  resource_id     uuid
  title           text
  content         text  -- full text searchable
  metadata        jsonb -- structured filters
  permissions     jsonb -- who can see this
  indexed_at      timestamptz
)

GIN INDEX on to_tsvector('english', title || ' ' || content)

search_index_chunks (
  id              uuid pk
  resource_id     uuid
  chunk_index     int
  text            text
  token_count     int
  embedding       vector(1536)  -- pgvector; or store in dedicated vector DB
  metadata        jsonb
  indexed_at      timestamptz
)

INDEX ON embedding USING hnsw (embedding vector_cosine_ops)

Chunking strategy:
- Documents over 500 tokens: chunk into 500-token segments with 50-token overlap
- Each chunk gets its own embedding
- Store chunk-level + document-level metadata

Embedding model:
- text-embedding-3-large (OpenAI) — default
- voyage-3 (Voyage) — popular alternative
- cohere-embed-v3 — alternative
- Cost: ~$0.13 per million tokens (cheap)

Generation strategy:
- On document update: regenerate ALL chunks
- Cost: small (most documents change rarely)
- Don't generate on read (latency)

Indexing lag:
- Document saved → index ready: target <30 seconds
- For high-write tenants: 1-2 minute lag acceptable
- Monitor + alert on lag

Permissions in index:
- Each indexed document tagged with workspace_id + permission tags
- Filter at query time
- NEVER expose document index entries across tenants

Implement:
1. Migration for search_index tables
2. Indexing worker (queue-driven; Inngest / Trigger.dev / your queue)
3. Embedding generation
4. Permission tagging at indexing time
5. Update-on-change webhook from source DB

Output: indexing pipeline.

3. Implement hybrid retrieval

Retrieval = combine lexical + semantic.

Why hybrid:
- Lexical: catches exact-keyword matches, named entities, specific terms
- Semantic: catches paraphrasing, related concepts, intent
- Together: 30%+ better recall than either alone

Retrieval flow:

async function searchAndRetrieve(query: string, workspaceId: string, userId: string) {
  // Step 1: Embed the query
  const queryEmbedding = await embed(query)
  
  // Step 2: Lexical search (Top-K)
  const lexicalResults = await db.searchIndexLexical.search({
    query,
    workspaceId,
    limit: 30,
    permissionsCheck: { userId },
  })
  
  // Step 3: Semantic search (Top-K)
  const semanticResults = await db.searchIndexChunks.semanticSearch({
    embedding: queryEmbedding,
    workspaceId,
    limit: 30,
    permissionsCheck: { userId },
  })
  
  // Step 4: Combine via Reciprocal Rank Fusion (RRF)
  const merged = reciprocalRankFusion([lexicalResults, semanticResults], { k: 60 })
  
  // Step 5: Rerank with cross-encoder (optional, expensive)
  // Use Cohere Rerank or similar
  const reranked = await cohereRerank(query, merged.slice(0, 20))
  
  // Step 6: Return top-N for LLM synthesis
  return reranked.slice(0, 5)
}

Reciprocal Rank Fusion (RRF):
- Combines rankings from multiple retrievers
- Score = sum(1 / (k + rank)) across retrievers
- k = 60 typical

Reranking:
- Cross-encoder model: takes (query, document) pairs; scores relevance
- Slower than retrieval but more accurate
- Cohere Rerank, Voyage Rerank, or open-source (BGE-Reranker)
- Apply to top 20 candidates; return top 5

Permission filtering:
- Apply BEFORE retrieval (don't fetch then filter)
- Index permission tags; query with WHERE user_can_see = true
- Defense-in-depth: also filter at LLM-context-building step

Performance budgets:
- Total retrieval (lexical + semantic + rerank): target <500ms p95
- Without rerank: <200ms p95
- LLM synthesis (next step): target <3s for streaming first token

Implement:
1. Lexical search function
2. Semantic search function (pgvector or external)
3. RRF combination
4. Reranking (optional; recommended)
5. Permission filtering at all layers
6. Caching for hot queries (60s TTL)

Output: retrieval that surfaces right answers.

4. Build the LLM synthesis layer

LLM synthesizes the answer. This is where quality is made or lost.

Synthesis flow:

async function synthesizeAnswer(query: string, retrievedDocs: Document[]) {
  const context = retrievedDocs.map((d, i) => 
    `[${i + 1}] ${d.title}\n${d.content}\n(source: ${d.url})`
  ).join('\n\n---\n\n')
  
  const systemPrompt = `
You are an assistant for [Product]. Answer the user's question based ONLY on the provided documents.
Cite documents using [1], [2] etc. matching the document numbers.
If the documents don't fully answer, say so clearly.
Be concise (under 200 words).
Don't make up features or facts.
`
  
  const userPrompt = `
Documents:
${context}

Question: ${query}
`
  
  return streamText({
    model: 'anthropic/claude-sonnet-4-7',
    system: systemPrompt,
    prompt: userPrompt,
    maxOutputTokens: 1000,
  })
}

Streaming UI:
- Token-by-token to UI (better UX)
- User sees agent "thinking"
- Show citations as they appear

Citation rendering:
- LLM outputs [1], [2] inline
- UI replaces with link to document
- Hover preview of cited content

Confidence handling:
- LLM might hallucinate
- Post-generation check: verify all claims have citations
- If LLM says "based on these documents..." but document doesn't actually contain that: flag

Quality controls:
- Eval harness with golden test set
- Run nightly via [Promptfoo / Braintrust]
- Score answers against expected
- Regression alerts

Cost controls:
- Per-query token budget (input + output)
- Per-customer monthly budget
- Per-tier rate limits

Cost estimate:
- Embedding query: ~$0.001
- Retrieval: $0
- Reranking: ~$0.001
- LLM synthesis: ~$0.01-0.05 per answer (Claude Sonnet 4.7; Haiku for cheaper)
- Total: ~$0.02-0.05 per Q&A

For high-volume: cache common questions; dedupe via Vercel AI Gateway.

Implement:
1. Synthesis function with streaming
2. Citation rendering
3. Confidence check
4. Cost tracking
5. Eval harness integration

Output: synthesis that doesn't hallucinate.

5. Build the customer-facing UI

"Ask anything" UI patterns:

Pattern A: Search bar with AI mode
- Existing search input; toggle "Ask AI" mode
- AI mode: natural-language input + AI answer with citations
- Falls back to search results

Pattern B: Dedicated "Ask" button
- Separate UI surface (e.g., "Ask AI" floating button or sidebar)
- Always-on AI conversation interface
- Conversation history (per session or persistent)

Pattern C: Inline AI suggestions
- Surface AI as suggestion cards in dashboards
- "Did you know? AI noticed..."
- Lower friction than asking

Default for B2B SaaS in 2026: Pattern A (search + AI toggle) is least disruptive; Pattern B for power users.

Required UI elements:

Input:
- Multi-line text input
- Suggested questions (3-5 starter prompts)
- Voice input optional

Loading state:
- "AI is thinking..." with skeleton
- Streaming tokens as they arrive
- Indicator that retrieval is happening

Response:
- AI-generated answer with citations
- Cited documents linked + previewable
- "Was this helpful?" feedback
- "Open in [Product]" links to specific document

Conversation history:
- Stack of past questions (within session)
- Previous answers re-renderable
- Optional: persist conversation across sessions

Empty / error states:
- "I'm not sure. Here are search results instead" → fallback search
- "I can't find that information"
- Don't make up answers

Privacy + transparency:
- "AI is processing your data" indicator (some users want to know)
- "Sources: [document titles]" visible
- Option to disable AI for sensitive workspaces

Implement:
1. Search/Ask UI component
2. Streaming response renderer
3. Citation hover preview
4. Feedback capture (thumbs up/down + comment)
5. Conversation history (session-level)
6. Empty / error states
7. Privacy indicators

Output: UI customers actually use.

6. Implement evaluations + iteration

Without evals, every prompt change is regression risk.

Eval components:

Golden test set:
- 50-100 representative queries customers ask
- Each with expected answer / expected citations
- Curated by you + customers

Categories:
- Specific lookup ("What's the deadline for project X?")
- Summarization ("Summarize what we did this quarter")
- Comparative ("Which customer is largest?")
- Reasoning across documents ("Are deadlines on track?")

For each:
- Expected: answer correctness, citation quality, no hallucinations
- Grade: 0-10 score from LLM-as-judge OR human review

Cadence:
- Nightly: full golden test suite via Promptfoo / Braintrust
- Per-PR: subset (5-min run)
- Weekly: human review of 10 random production queries

Quality dashboards:
- Score distribution by category
- Hallucination rate (sampled audit)
- Customer feedback (thumbs ratio)
- Retrieval recall (top-5 contains expected citation?)

Failure clustering:
- Capture every failed query (low score or thumbs-down)
- Cluster by failure type
- Track: "wrong document retrieved", "hallucination", "incomplete"
- Feed back to retrieval / prompt iteration

A/B testing:
- New prompt vs old on subset of traffic
- Measure: customer satisfaction, latency, cost
- Promote winner

Implement:
1. Eval test set
2. Eval runner (Promptfoo / Braintrust)
3. CI integration
4. Quality dashboards
5. Weekly failure-review process
6. A/B testing framework

Output: confidence in changes.

7. Cost controls + operational realities

Per-query cost:
- Embedding: ~$0.001
- LLM synthesis: $0.01-0.05
- Reranking: ~$0.001
- Total: ~$0.02-0.05

At 1000 queries/customer/month: $20-50/customer
At 10000 queries/customer/month: $200-500/customer

Budget controls:

Per-tier limits:
- Free: 10 queries/day
- Pro: 100 queries/day
- Enterprise: 1000 queries/day OR custom

Per-query token cap:
- Max input + output tokens (e.g., 4000 in + 1000 out)
- Reject overlong queries

Per-customer monthly budget:
- Soft limit at 80%; alert customer
- Hard limit at 100%; require upgrade or wait

Caching:
- Common queries (e.g., "summarize this project") cache 5-15 min
- Per-tenant; respect permissions
- Save 30-60% of cost for popular queries

Cheaper models:
- Use Haiku for simple lookups (~5x cheaper)
- Sonnet for complex reasoning
- Gateway routes based on complexity

Edge cases:

1. Workspace with massive data
- Index size limits (e.g., 1M documents per workspace; soft limit)
- For larger: pagination + deeper retrieval

2. Documents updated continuously
- Indexing lag visible
- Show "indexed N seconds ago" in UI

3. Customer deletes document
- Remove from index immediately
- AI doesn't surface stale data

4. Permission changes
- User permission revoked → re-filter retrieval
- Cached AI answers may show old data
- Invalidate cache on permission change

5. Sensitive data (PII; financial)
- Customer admin: "exclude these docs from AI"
- Per-document AI-eligibility flag

6. AI hallucinates
- Detected via confidence + citation check
- Customer flag → retrain prompt
- Eventual: regression-test against documented hallucinations

7. Cross-language queries
- User asks in Japanese; documents in English
- LLM handles; retrieval may not (embeddings cross-lingual better than lexical)
- Multi-lingual embedding models help

8. AI loops in conversation
- "What about that? Tell me more."
- Use conversation context (previous Q&A)
- Cap conversation history (last 10 turns)

9. Customer's data is too large for context window
- Even with retrieval, top-5 docs may exceed 100K tokens
- Aggressive chunking
- Map-reduce summarization

10. Latency spikes
- LLM provider slow; first-token >5s
- Fallback to cached / pre-computed
- User-facing "AI is slow today" message

For each: code change + UX impact + ops alert.

Output: a system that doesn't blow up the bill.

8. Recap

What you've built:

  • Indexing pipeline (lexical + semantic embeddings)
  • Hybrid retrieval with permission filtering
  • LLM synthesis with citations
  • Streaming UI
  • Feedback capture
  • Eval harness + nightly tests
  • Cost controls + budgets
  • Cache + dedupe optimization
  • Operational alerts + monitoring

What you're explicitly NOT shipping in v1:

  • Agentic actions (defer to Level 3; see In-Product AI Agent Implementation)
  • Cross-customer AI (anti-pattern; multi-tenant boundary)
  • Voice input (defer; nice-to-have)
  • Persistent multi-session conversations (defer; complexity)
  • Multi-modal (image / audio inputs) (defer)
  • Customer-tunable prompts (defer)

Ship Level 1 in 4-8 weeks. Add Level 2 (full Q&A) when retrieval quality solid. Defer Level 3 until Q&A trust earned.

The biggest mistake teams make: shipping AI Q&A before retrieval is good. Garbage in (wrong docs retrieved) → garbage out (wrong answer). Get retrieval to 70%+ recall first.

The second mistake: skipping permission filtering. Easy bug: AI surfaces a document the user shouldn't see. Career-ending.

The third mistake: skipping evals. Every prompt change is a regression risk. Even 20 test cases beats nothing.

See Also