In-Product AI Search & Q&A on Customer Data — Chat Prompts
If your B2B SaaS has any customer data — projects, documents, conversations, tickets, contacts, files — by 2026 customers will expect AI-powered search + Q&A across THEIR data. Linear's AI search across issues, Notion's AI Q&A across workspace, Slack's AI summarization across channels, Linear AI assistants — every modern SaaS ships some version. The competitive frontier moved from "we have search" to "I can ask my product anything in natural language and it answers from my data."
This is distinct from In-Product Help Center & Knowledge Base (AI on YOUR docs) and from Workplace AI Search Tools (Reference) (Glean / Dust — search across the customer's many tools). This is AI search + Q&A within YOUR product on the customer's data inside YOUR product.
The naive shape: bolt OpenAI on top of full-text search. Works for v0 demo; produces hallucinations + slow + costly + privacy concerns at scale. The right shape: hybrid retrieval (lexical + semantic) + ranking + LLM synthesis with citations + permission enforcement + cost controls + evaluation.
This chat walks through implementing real in-product AI search + Q&A: data model, indexing pipeline, retrieval architecture, LLM synthesis, permissions, citations, cost controls, and the operational realities.
What you're building
- Search index across customer's data (projects, docs, tickets, etc.)
- Hybrid retrieval (lexical + semantic embeddings)
- Permission-aware retrieval (don't surface data the user can't see)
- LLM synthesis with citations
- Streaming UI (token-by-token answer)
- "Ask anything" UI surface
- Cost controls (per-customer, per-query budget)
- Evaluation harness (regression-test answer quality)
- Operational concerns (indexing lag, permission changes, deletion)
1. Decide the scope BEFORE building
Help me decide what shape of in-product AI search to ship.
Three increasingly-deep shapes:
LEVEL 1: AI-AUGMENTED SEARCH (the simplest start)
- Keep existing search; add AI summary + answer at top
- "Smart answer: here's what I found..."
- Falls back to search results if AI uncertain
- Pros: 4-6 weeks; clear UX
- Cons: limited; doesn't reason across documents deeply
LEVEL 2: NATURAL-LANGUAGE Q&A
- Customer types question in natural language
- AI retrieves relevant documents + synthesizes answer
- Cites specific documents
- "Talk to your data" experience
- Pros: 8-16 weeks; high customer value
- Cons: more engineering; eval discipline; cost
LEVEL 3: AGENTIC IN-PRODUCT ASSISTANT
- Beyond Q&A: takes ACTIONS in product based on questions
- "Reschedule all meetings with Acme to next week"
- "Summarize this project + create follow-up tasks"
- Pros: deeply integrated; productivity multiplier
- Cons: 24+ weeks; safety + permissions complex; see [In-Product AI Agent Implementation](./in-product-ai-agent-implementation-chat)
DEFAULT FOR MOST B2B SaaS:
- Year 1 with AI features: Level 1 (AI summary on top of search)
- Year 1+: Level 2 (full Q&A) when scope is clear
- Year 2+: Level 3 (agentic) when Level 2 has earned trust
- Don't pre-build Level 3
Output: explicit scope; what's in v1; what's not.
Output: scope statement.
2. Design the indexing pipeline
For AI Q&A, you need fast retrieval across customer data. That means an index.
Architecture:
Source data:
- Live in primary DB (Postgres / your stack)
- Includes: documents, projects, tasks, comments, etc. (whatever is searchable)
Index data:
- Lexical: Postgres FTS, Algolia, Typesense, or OpenSearch
- Semantic: vector embeddings stored in Postgres (pgvector), Pinecone, Weaviate, or your vector DB
- Hybrid retrieval: combine both
Indexing pipeline:
1. Source change event (e.g., document_updated)
2. Push to index queue
3. Worker:
- Fetches updated document
- Chunks document if large (e.g., 500-1000 tokens per chunk)
- Generates lexical index entry (insert or update)
- Generates embedding (per chunk)
- Stores in vector DB
- Updates indexed_at timestamp on source
Schema:
search_index_lexical (
id uuid pk
workspace_id uuid
resource_type text
resource_id uuid
title text
content text -- full text searchable
metadata jsonb -- structured filters
permissions jsonb -- who can see this
indexed_at timestamptz
)
GIN INDEX on to_tsvector('english', title || ' ' || content)
search_index_chunks (
id uuid pk
resource_id uuid
chunk_index int
text text
token_count int
embedding vector(1536) -- pgvector; or store in dedicated vector DB
metadata jsonb
indexed_at timestamptz
)
INDEX ON embedding USING hnsw (embedding vector_cosine_ops)
Chunking strategy:
- Documents over 500 tokens: chunk into 500-token segments with 50-token overlap
- Each chunk gets its own embedding
- Store chunk-level + document-level metadata
Embedding model:
- text-embedding-3-large (OpenAI) — default
- voyage-3 (Voyage) — popular alternative
- cohere-embed-v3 — alternative
- Cost: ~$0.13 per million tokens (cheap)
Generation strategy:
- On document update: regenerate ALL chunks
- Cost: small (most documents change rarely)
- Don't generate on read (latency)
Indexing lag:
- Document saved → index ready: target <30 seconds
- For high-write tenants: 1-2 minute lag acceptable
- Monitor + alert on lag
Permissions in index:
- Each indexed document tagged with workspace_id + permission tags
- Filter at query time
- NEVER expose document index entries across tenants
Implement:
1. Migration for search_index tables
2. Indexing worker (queue-driven; Inngest / Trigger.dev / your queue)
3. Embedding generation
4. Permission tagging at indexing time
5. Update-on-change webhook from source DB
Output: indexing pipeline.
3. Implement hybrid retrieval
Retrieval = combine lexical + semantic.
Why hybrid:
- Lexical: catches exact-keyword matches, named entities, specific terms
- Semantic: catches paraphrasing, related concepts, intent
- Together: 30%+ better recall than either alone
Retrieval flow:
async function searchAndRetrieve(query: string, workspaceId: string, userId: string) {
// Step 1: Embed the query
const queryEmbedding = await embed(query)
// Step 2: Lexical search (Top-K)
const lexicalResults = await db.searchIndexLexical.search({
query,
workspaceId,
limit: 30,
permissionsCheck: { userId },
})
// Step 3: Semantic search (Top-K)
const semanticResults = await db.searchIndexChunks.semanticSearch({
embedding: queryEmbedding,
workspaceId,
limit: 30,
permissionsCheck: { userId },
})
// Step 4: Combine via Reciprocal Rank Fusion (RRF)
const merged = reciprocalRankFusion([lexicalResults, semanticResults], { k: 60 })
// Step 5: Rerank with cross-encoder (optional, expensive)
// Use Cohere Rerank or similar
const reranked = await cohereRerank(query, merged.slice(0, 20))
// Step 6: Return top-N for LLM synthesis
return reranked.slice(0, 5)
}
Reciprocal Rank Fusion (RRF):
- Combines rankings from multiple retrievers
- Score = sum(1 / (k + rank)) across retrievers
- k = 60 typical
Reranking:
- Cross-encoder model: takes (query, document) pairs; scores relevance
- Slower than retrieval but more accurate
- Cohere Rerank, Voyage Rerank, or open-source (BGE-Reranker)
- Apply to top 20 candidates; return top 5
Permission filtering:
- Apply BEFORE retrieval (don't fetch then filter)
- Index permission tags; query with WHERE user_can_see = true
- Defense-in-depth: also filter at LLM-context-building step
Performance budgets:
- Total retrieval (lexical + semantic + rerank): target <500ms p95
- Without rerank: <200ms p95
- LLM synthesis (next step): target <3s for streaming first token
Implement:
1. Lexical search function
2. Semantic search function (pgvector or external)
3. RRF combination
4. Reranking (optional; recommended)
5. Permission filtering at all layers
6. Caching for hot queries (60s TTL)
Output: retrieval that surfaces right answers.
4. Build the LLM synthesis layer
LLM synthesizes the answer. This is where quality is made or lost.
Synthesis flow:
async function synthesizeAnswer(query: string, retrievedDocs: Document[]) {
const context = retrievedDocs.map((d, i) =>
`[${i + 1}] ${d.title}\n${d.content}\n(source: ${d.url})`
).join('\n\n---\n\n')
const systemPrompt = `
You are an assistant for [Product]. Answer the user's question based ONLY on the provided documents.
Cite documents using [1], [2] etc. matching the document numbers.
If the documents don't fully answer, say so clearly.
Be concise (under 200 words).
Don't make up features or facts.
`
const userPrompt = `
Documents:
${context}
Question: ${query}
`
return streamText({
model: 'anthropic/claude-sonnet-4-7',
system: systemPrompt,
prompt: userPrompt,
maxOutputTokens: 1000,
})
}
Streaming UI:
- Token-by-token to UI (better UX)
- User sees agent "thinking"
- Show citations as they appear
Citation rendering:
- LLM outputs [1], [2] inline
- UI replaces with link to document
- Hover preview of cited content
Confidence handling:
- LLM might hallucinate
- Post-generation check: verify all claims have citations
- If LLM says "based on these documents..." but document doesn't actually contain that: flag
Quality controls:
- Eval harness with golden test set
- Run nightly via [Promptfoo / Braintrust]
- Score answers against expected
- Regression alerts
Cost controls:
- Per-query token budget (input + output)
- Per-customer monthly budget
- Per-tier rate limits
Cost estimate:
- Embedding query: ~$0.001
- Retrieval: $0
- Reranking: ~$0.001
- LLM synthesis: ~$0.01-0.05 per answer (Claude Sonnet 4.7; Haiku for cheaper)
- Total: ~$0.02-0.05 per Q&A
For high-volume: cache common questions; dedupe via Vercel AI Gateway.
Implement:
1. Synthesis function with streaming
2. Citation rendering
3. Confidence check
4. Cost tracking
5. Eval harness integration
Output: synthesis that doesn't hallucinate.
5. Build the customer-facing UI
"Ask anything" UI patterns:
Pattern A: Search bar with AI mode
- Existing search input; toggle "Ask AI" mode
- AI mode: natural-language input + AI answer with citations
- Falls back to search results
Pattern B: Dedicated "Ask" button
- Separate UI surface (e.g., "Ask AI" floating button or sidebar)
- Always-on AI conversation interface
- Conversation history (per session or persistent)
Pattern C: Inline AI suggestions
- Surface AI as suggestion cards in dashboards
- "Did you know? AI noticed..."
- Lower friction than asking
Default for B2B SaaS in 2026: Pattern A (search + AI toggle) is least disruptive; Pattern B for power users.
Required UI elements:
Input:
- Multi-line text input
- Suggested questions (3-5 starter prompts)
- Voice input optional
Loading state:
- "AI is thinking..." with skeleton
- Streaming tokens as they arrive
- Indicator that retrieval is happening
Response:
- AI-generated answer with citations
- Cited documents linked + previewable
- "Was this helpful?" feedback
- "Open in [Product]" links to specific document
Conversation history:
- Stack of past questions (within session)
- Previous answers re-renderable
- Optional: persist conversation across sessions
Empty / error states:
- "I'm not sure. Here are search results instead" → fallback search
- "I can't find that information"
- Don't make up answers
Privacy + transparency:
- "AI is processing your data" indicator (some users want to know)
- "Sources: [document titles]" visible
- Option to disable AI for sensitive workspaces
Implement:
1. Search/Ask UI component
2. Streaming response renderer
3. Citation hover preview
4. Feedback capture (thumbs up/down + comment)
5. Conversation history (session-level)
6. Empty / error states
7. Privacy indicators
Output: UI customers actually use.
6. Implement evaluations + iteration
Without evals, every prompt change is regression risk.
Eval components:
Golden test set:
- 50-100 representative queries customers ask
- Each with expected answer / expected citations
- Curated by you + customers
Categories:
- Specific lookup ("What's the deadline for project X?")
- Summarization ("Summarize what we did this quarter")
- Comparative ("Which customer is largest?")
- Reasoning across documents ("Are deadlines on track?")
For each:
- Expected: answer correctness, citation quality, no hallucinations
- Grade: 0-10 score from LLM-as-judge OR human review
Cadence:
- Nightly: full golden test suite via Promptfoo / Braintrust
- Per-PR: subset (5-min run)
- Weekly: human review of 10 random production queries
Quality dashboards:
- Score distribution by category
- Hallucination rate (sampled audit)
- Customer feedback (thumbs ratio)
- Retrieval recall (top-5 contains expected citation?)
Failure clustering:
- Capture every failed query (low score or thumbs-down)
- Cluster by failure type
- Track: "wrong document retrieved", "hallucination", "incomplete"
- Feed back to retrieval / prompt iteration
A/B testing:
- New prompt vs old on subset of traffic
- Measure: customer satisfaction, latency, cost
- Promote winner
Implement:
1. Eval test set
2. Eval runner (Promptfoo / Braintrust)
3. CI integration
4. Quality dashboards
5. Weekly failure-review process
6. A/B testing framework
Output: confidence in changes.
7. Cost controls + operational realities
Per-query cost:
- Embedding: ~$0.001
- LLM synthesis: $0.01-0.05
- Reranking: ~$0.001
- Total: ~$0.02-0.05
At 1000 queries/customer/month: $20-50/customer
At 10000 queries/customer/month: $200-500/customer
Budget controls:
Per-tier limits:
- Free: 10 queries/day
- Pro: 100 queries/day
- Enterprise: 1000 queries/day OR custom
Per-query token cap:
- Max input + output tokens (e.g., 4000 in + 1000 out)
- Reject overlong queries
Per-customer monthly budget:
- Soft limit at 80%; alert customer
- Hard limit at 100%; require upgrade or wait
Caching:
- Common queries (e.g., "summarize this project") cache 5-15 min
- Per-tenant; respect permissions
- Save 30-60% of cost for popular queries
Cheaper models:
- Use Haiku for simple lookups (~5x cheaper)
- Sonnet for complex reasoning
- Gateway routes based on complexity
Edge cases:
1. Workspace with massive data
- Index size limits (e.g., 1M documents per workspace; soft limit)
- For larger: pagination + deeper retrieval
2. Documents updated continuously
- Indexing lag visible
- Show "indexed N seconds ago" in UI
3. Customer deletes document
- Remove from index immediately
- AI doesn't surface stale data
4. Permission changes
- User permission revoked → re-filter retrieval
- Cached AI answers may show old data
- Invalidate cache on permission change
5. Sensitive data (PII; financial)
- Customer admin: "exclude these docs from AI"
- Per-document AI-eligibility flag
6. AI hallucinates
- Detected via confidence + citation check
- Customer flag → retrain prompt
- Eventual: regression-test against documented hallucinations
7. Cross-language queries
- User asks in Japanese; documents in English
- LLM handles; retrieval may not (embeddings cross-lingual better than lexical)
- Multi-lingual embedding models help
8. AI loops in conversation
- "What about that? Tell me more."
- Use conversation context (previous Q&A)
- Cap conversation history (last 10 turns)
9. Customer's data is too large for context window
- Even with retrieval, top-5 docs may exceed 100K tokens
- Aggressive chunking
- Map-reduce summarization
10. Latency spikes
- LLM provider slow; first-token >5s
- Fallback to cached / pre-computed
- User-facing "AI is slow today" message
For each: code change + UX impact + ops alert.
Output: a system that doesn't blow up the bill.
8. Recap
What you've built:
- Indexing pipeline (lexical + semantic embeddings)
- Hybrid retrieval with permission filtering
- LLM synthesis with citations
- Streaming UI
- Feedback capture
- Eval harness + nightly tests
- Cost controls + budgets
- Cache + dedupe optimization
- Operational alerts + monitoring
What you're explicitly NOT shipping in v1:
- Agentic actions (defer to Level 3; see In-Product AI Agent Implementation)
- Cross-customer AI (anti-pattern; multi-tenant boundary)
- Voice input (defer; nice-to-have)
- Persistent multi-session conversations (defer; complexity)
- Multi-modal (image / audio inputs) (defer)
- Customer-tunable prompts (defer)
Ship Level 1 in 4-8 weeks. Add Level 2 (full Q&A) when retrieval quality solid. Defer Level 3 until Q&A trust earned.
The biggest mistake teams make: shipping AI Q&A before retrieval is good. Garbage in (wrong docs retrieved) → garbage out (wrong answer). Get retrieval to 70%+ recall first.
The second mistake: skipping permission filtering. Easy bug: AI surfaces a document the user shouldn't see. Career-ending.
The third mistake: skipping evals. Every prompt change is a regression risk. Even 20 test cases beats nothing.
See Also
- In-Product Help Center & Knowledge Base — sister category (AI on YOUR docs)
- In-Product AI Agent Implementation — adjacent (agentic; takes actions)
- Search — depended-upon
- Search Autocomplete & Typeahead — adjacent
- RAG Implementation — depended-upon technique
- LLM Cost Optimization — depended-upon
- LLM Quality Monitoring — depended-upon
- AI Streaming Chat UI — depended-upon UI pattern
- AI Features Implementation — adjacent
- Roles & Permissions — depended-upon
- Audit Logs — pairs for AI query audit
- Quotas, Limits & Plan Enforcement — pairs for budget controls
- Schema Validation Zod — depended-upon
- Background Jobs & Queue Management — pairs for indexing
- Multi-Tenancy — depended-upon
- Vector Database Providers (Reference) — tooling
- LLM Evaluation & Prompt Testing Platforms (Reference) — tooling
- Workplace AI Search Tools (Reference) — adjacent (different scope: customer's many tools)