VibeWeek
Home/Grow/AI Features in Your SaaS: Ship LLM Capabilities Without Burning Margins or Trust

AI Features in Your SaaS: Ship LLM Capabilities Without Burning Margins or Trust

⬅️ Growth Overview

AI Feature Strategy for Your New SaaS

Goal: Ship LLM-powered features (chat, summarization, generation, classification) that customers actually use without burning your unit economics, hallucinating users into incorrect data, or shipping prompts that drift from working to broken without notice. Use a gateway, manage prompts as code, stream responses, set quotas per tier, evaluate quality continuously, and observe production traffic. Avoid the failure modes where founders ship raw OpenAI calls inline (no observability, no failover, no cost control), put system prompts in code with no versioning ("we changed the prompt three weeks ago and now it's bad"), or skip evaluation (you find out about quality regressions from customer support tickets).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: First AI feature behind gateway with streaming + per-tier limits in 2-3 days. Prompt management + observability in week 1. Evaluation + cost dashboards in week 2. Quarterly review baked in.


Why Most Founder AI Features Are Broken

Three failure modes hit founders the same way:

  • Direct API calls without abstraction. Founder writes openai.chat.completions.create(...) inline in a route handler. Six months later, switching providers requires touching 47 files; observability is non-existent; cost-by-feature is unknown; prompt changes are deploys.
  • Prompts in code with no versioning. System prompts live as string literals in the codebase. Someone "fixes" the prompt; quality regresses; nobody notices for two weeks; customer trust drops; reverting requires git archaeology.
  • No quotas per tier. AI calls are unmetered. A free user runs the AI 1,000 times in one weekend; the OpenAI bill triples; founder discovers the next month. Or worse: a single customer scripts the feature into a loop that costs more than their annual subscription overnight.

The version that works is structured: route through an AI gateway, manage prompts as versioned configuration, stream responses for UX, enforce per-tier quotas (per rate-limiting-abuse-chat), evaluate quality before deploys, and observe production traffic with a LLM observability tool.

This guide assumes you have already done Authentication (AI calls are user-scoped), have shipped Multi-Tenant Data Isolation (workspace context for AI calls), have considered LLM Cost Optimization and LLM Quality Monitoring, and have shipped Rate Limiting & Abuse Prevention (AI endpoints are the highest-cost abuse vector).


1. Decide What AI Should Do Before Writing Prompts

The first question is product, not technical. Don''t ship AI as a generic "chatbot" — pick specific value-creating features.

Help me decide which AI features fit [my product].

The high-value patterns:

**Pattern 1: Replace tedium**
- Auto-categorize support tickets / leads / data rows
- Generate first-draft replies / summaries / titles
- Extract structured data from messy input (emails, PDFs)
- Time saved per use is concrete and measurable

**Pattern 2: Augment expertise**
- Suggest improvements to user-written content (writing assistant)
- Surface non-obvious connections in user data
- Recommend next actions
- Each use feels intelligent if done well

**Pattern 3: Conversational search / Q&A**
- "Ask your data" interface over user content
- Documentation chat
- Per [search-chat](search-chat.md): often paired with hybrid retrieval

**Pattern 4: Structured output / classification**
- Sentiment analysis, intent classification, tagging
- Lower stakes than open-ended generation
- Most cost-effective AI use

**Pattern 5: Generation**
- Image generation, copy generation, code generation
- High value if the output replaces a manual process
- Most expensive per call

**Anti-patterns**:

- **Chatbot for the sake of chatbot** — users don''t want to chat with your tool; they want to do work
- **AI features that exist because "AI is in the press"** — if you can''t name the value, skip
- **Vague "smart" features** — specificity beats novelty

For my product, ask:
- What''s the most-tedious task my users do?
- Where do they currently use ChatGPT / Claude as a separate tool?
- What classifications / extractions / summaries would feel like magic?

Output:
1. The top 1-3 AI features with clear user value
2. The "why now" justification per feature
3. The cost-per-use ballpark per feature
4. The metric you''ll track (time saved, conversion lift, retention)

The biggest unforced error: shipping a "chatbot" because it''s easy. Most users don''t want to type to a chatbot; they want a button that does the work. The button + LLM-under-the-hood is more valuable than the chat for most product use cases.


2. Route Through a Gateway, Not Direct API Calls

A gateway gives you observability, failover, cost tracking, and provider portability. Don''t skip.

Help me design the gateway abstraction.

The pattern:

**Don''t**:
```ts
const completion = await openai.chat.completions.create({
  model: 'gpt-5',
  messages: [...]
})

Do:

import { generateText } from 'ai'

const { text } = await generateText({
  model: 'anthropic/claude-sonnet-4-6',  // routed through Vercel AI Gateway
  system: getPrompt('summarize.system'),
  prompt: userInput,
})

Gateway options (per AI Gateways):

  • Vercel AI Gateway — bundled with Vercel; provider/model strings
  • OpenRouter — multi-provider; model marketplace
  • Cloudflare AI Gateway — Cloudflare-stack
  • Portkey — full-featured; fallbacks; budgeting
  • DIY proxy — own everything; more work

For most indie SaaS in 2026 on Vercel: Vercel AI Gateway with the AI SDK is the default. Use plain "provider/model" strings.

What the gateway gives you:

  • Provider failover (OpenAI down → Anthropic)
  • Per-feature cost tracking
  • Rate limiting at the gateway layer
  • Observability (per LLM observability)
  • Caching of duplicate prompts
  • Easier model swaps

Critical implementation rules:

  1. Never call provider SDKs directly in product code. Always go through gateway.
  2. Provider/model strings are the abstraction (e.g., "anthropic/claude-sonnet-4-6"). Code doesn''t know which provider; product config decides.
  3. Default to AI SDK (per ai-sdk) for TypeScript / Node.
  4. Centralize the model-selection logic. A function pickModel(featureName, complexity) that returns the right model string.

Cost-aware routing:

  • Cheap models for simple tasks (classification, short summarization): GPT-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash
  • Mid-tier for most tasks: GPT-5, Claude Sonnet 4.6
  • Top-tier for complex reasoning: Claude Opus 4.7, GPT-5 Pro
  • Never default to top-tier for everything — burns money

Don''t:

  • Hardcode provider names in product code
  • Skip the gateway "for now"
  • Pick top-tier models for tasks where mid-tier is fine

Output:

  1. The gateway choice
  2. The provider/model strings used per feature
  3. The model-selection logic
  4. The migration path if currently calling APIs directly

The single biggest engineering lever: **the gateway abstraction.** Once provider/model is a string in config, switching providers is a config change. Without it, switching is a multi-week migration. Pay the small upfront cost.

---

## 3. Manage Prompts as Code (or Configuration)

Prompts in raw string literals scattered across files = unmaintainable. Centralize.

Design prompt management.

The patterns:

Pattern A: Prompts in code (versioned)

// prompts/summarize.ts
export const SUMMARIZE_SYSTEM_PROMPT = `
You are an assistant that summarizes [content type].
- Output 2-3 bullets
- Each bullet under 15 words
- Plain text, no markdown
`.trim()

// usage
const { text } = await generateText({
  model: pickModel('summarize'),
  system: SUMMARIZE_SYSTEM_PROMPT,
  prompt: userInput,
})

Pros:

  • Version-controlled with code
  • Type-safe
  • Easy to test

Cons:

  • Changes require deploys
  • Non-engineers can''t edit

Pattern B: Prompts in observability tool (Langfuse, Braintrust, LangSmith)

const prompt = await langfuse.getPrompt('summarize-system')
const { text } = await generateText({
  model: pickModel('summarize'),
  system: prompt.compile({ contentType: 'meeting notes' }),
  prompt: userInput,
})

Pros:

  • Non-engineers can edit prompts (PMs, content writers)
  • Versioning with rollback
  • A/B testing prompts in production
  • Prompt history visible in observability tool

Cons:

  • Network call to fetch prompt (cache aggressively)
  • Coupling to observability tool

Pattern C: Prompts in YAML/JSON config

# prompts.yaml
summarize:
  system: |
    You are an assistant that summarizes...
  model: anthropic/claude-sonnet-4-6
  temperature: 0.3

Pros:

  • Version-controlled
  • Easier for non-engineers to edit (still requires PR)

Cons:

  • No live editing
  • Less rich than full prompt-management tools

For most indie SaaS in 2026:

  • Start with Pattern A (code)
  • Move to Pattern B (Langfuse) once prompts are stable and non-engineers want to iterate

Critical implementation rules:

  1. Never inline prompts in route handlers mixed with business logic
  2. Version prompts explicitly (semver or date-based)
  3. Test every prompt — at minimum a smoke test that asserts a known input produces an expected shape of output
  4. Document the contract — what the prompt expects as input, what it produces as output

Prompt-engineering basics worth following:

  • System prompt sets behavior ("You are X. You do Y. You output Z format.")
  • User prompt is the data (the variable input)
  • Examples in system prompt (1-3 few-shot examples improve consistency)
  • Output format clarity ("Output JSON with keys A, B, C")
  • Constraints help ("Do not include URLs.")
  • Length specifications ("Each summary under 100 words.")

Don''t:

  • Mix prompt building with business logic
  • Skip prompt versioning (every prompt change is a deployment risk)
  • Trust prompts to "just work" — test them

Output:

  1. The prompt-management approach (A / B / C)
  2. The prompt catalog (5-10 prompts with names, system prompts, expected outputs)
  3. The prompt-test suite (assertions per prompt)
  4. The prompt-versioning convention

The single biggest reliability win: **a snapshot test for each prompt.** Run input X, assert output matches shape Y. When someone changes a prompt and the test fails, they see the regression before customers do. Without it, prompt drift is invisible until support tickets accumulate.

---

## 4. Stream Responses for UX

LLM responses are slow. Streaming makes them feel fast. Use it everywhere user-facing.

Design streaming.

The pattern (with Vercel AI SDK):

// Server route
import { streamText } from 'ai'

export async function POST(req: Request) {
  const { messages } = await req.json()
  const result = streamText({
    model: 'anthropic/claude-sonnet-4-6',
    system: getPrompt('chat.system'),
    messages,
  })
  return result.toUIMessageStreamResponse()
}

// Client
import { useChat } from 'ai/react'

function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat({
    api: '/api/chat',
  })
  return (
    <div>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
      </form>
    </div>
  )
}

Benefits:

  • Time-to-first-token is what feels fast (often <500ms)
  • Total latency reduces perceived wait
  • Users see the AI "thinking"
  • Can cancel mid-generation

When NOT to stream:

  • Structured output where partial JSON is unparseable
  • Background jobs where the output goes to DB, not UI
  • Very short responses (overhead exceeds benefit)
  • Functions used for classification (small response; not a conversation)

For non-chat features (one-shot generation):

// Stream a single generation result
const { textStream } = streamText({
  model: 'anthropic/claude-sonnet-4-6',
  prompt: 'Summarize this meeting',
})
for await (const delta of textStream) {
  // Append to UI
}

Critical implementation rules:

  1. Handle stream cancellation (user closes tab; clean up server resources)
  2. Show a stop button so users can interrupt
  3. Persist final result on completion (don''t lose generation if connection drops mid-stream)
  4. Handle errors gracefully (provider down → fallback to error message; don''t hang forever)
  5. Set timeouts (max 60-120s for chat; abort and surface error)

Cost implications of streaming:

  • Streaming uses the same token count as non-streaming
  • BUT: you can detect bad responses early and abort (saves tokens)
  • And: users can interrupt off-topic responses (saves tokens)

Don''t:

  • Skip streaming for chat / generation UX (will feel slow)
  • Stream when it doesn''t help (background jobs)
  • Forget cancellation handling

Output:

  1. The streaming endpoints
  2. The client integration
  3. The cancellation logic
  4. The error-handling

The biggest perceived-performance win: **streaming.** A 5-second non-streamed response feels broken; the same 5-second streamed response feels engaging. Streaming is required UX for any user-facing AI feature.

---

## 5. Enforce Per-Tier Quotas

AI calls are the most-expensive endpoint class. Quota them per tier (per [rate-limiting-abuse](rate-limiting-abuse-chat.md)).

Design AI quotas.

The pattern:

For each tier, define:

Limit Free Pro Business Enterprise
AI generations / day 10 500 5,000 custom
Tokens / day 50K 5M 50M custom
AI cost cap / day $0.10 $5 $50 custom
Concurrent AI requests 1 5 20 custom

Calculate from unit economics:

  • Per-request cost: tokens × per-token price (varies by model)
  • Per-customer monthly cost: per-request × monthly limit
  • Subtract from tier revenue: must be positive margin

Implementation:

async function generateWithQuota(workspaceId: string, prompt: string) {
  const usage = await getDailyAIUsage(workspaceId)
  const limit = await getAILimit(workspaceId)
  
  if (usage.cost >= limit.dailyCostCap) {
    throw new Error('quota_exceeded')
  }
  
  const result = await generateText({...})
  
  // Track usage
  await recordAIUsage(workspaceId, result.usage)
  
  return result
}

Quota dimensions worth tracking:

  • Per-day call count (simple)
  • Per-day token count (more accurate)
  • Per-day cost (best aligned to your bill)
  • Concurrent in-flight (prevents loops)

Friendly UX when quota hits:

  • 80% used: subtle banner ("You''ve used 80% of your daily AI quota")
  • 100%: blocking message ("AI quota reached for today. Upgrade or wait until [time]")
  • Don''t show internal cost numbers; show "AI requests remaining"

Per-feature quotas:

Some features cost more than others:

  • Image generation: 1 image = ~10x text cost; lower per-day limit
  • Long generation (full report): higher token cost per call
  • Vision (image understanding): higher input token cost

Different features can have different quotas; track per feature.

Kill switch for individual users:

If a single user racks up unusual cost (10x normal in 1 hour):

  • Auto-pause AI for that user
  • Notify support
  • Manual review

Per rate-limiting-abuse-chat: the kill-switch protects against runaway costs.

Don''t:

  • Skip quotas (you''ll find out when the bill arrives)
  • Use a single global quota (per-user matters)
  • Hide quota info from customers (transparency builds trust)

Output:

  1. The per-tier quota table
  2. The unit-economic calculation
  3. The quota-enforcement code
  4. The UX for "approaching" / "exceeded" states
  5. The kill-switch logic

The single biggest cost-protection: **per-user daily cost cap.** A user looping the AI accidentally racks up $200 in inference; your cap catches at $5. Without it, the bill is yours; with it, the user gets a polite "limit reached" message.

---

## 6. Evaluate Quality Before Deploying Prompt Changes

Prompts can regress invisibly. Run evals.

Design the eval workflow.

The pattern:

Build an eval dataset:

For each AI feature, collect:

  • 20-50 example inputs
  • Expected outputs (or scoring criteria)
  • Edge cases that previously failed

Eval per prompt change:

When prompt is updated:

  1. Run new prompt against the dataset
  2. Score each output (per criteria)
  3. Compare against baseline (current production)
  4. Block deploy if score regresses

Scoring methods:

  • Exact match: works for classification ("category X" expected; "category X" got)
  • Semantic similarity: works for summaries (cosine similarity to expected; or LLM-judge)
  • LLM-as-judge: another LLM scores the output 1-10 on criteria
  • Hand-graded: small datasets where humans score
  • Functional tests: "output must be valid JSON with keys A/B/C"

Tools (per LLM observability):

  • Braintrust — eval-first
  • Langfuse — evals included
  • LangSmith — evals strong
  • Custom — script that runs prompts against dataset

CI integration:

# .github/workflows/eval.yml
on: pull_request
jobs:
  eval:
    if: contains(github.event.pull_request.changed_files, 'prompts/')
    steps:
      - run: npm run eval
      # Fail PR if eval score drops below threshold

Critical implementation rules:

  1. Test before every prompt change. Don''t skip "small fixes."
  2. Maintain the dataset. When customers report bad outputs, add them as eval cases.
  3. Set quality threshold per feature. Better to fail PRs that regress than to ship.
  4. Track quality over time. Plot eval scores; spot drift even when individual changes pass.

Don''t:

  • Skip evals on "minor" prompt changes
  • Trust that "it worked in testing" — production data is different
  • Use the same eval cases that the prompt was written against (overfitting)

Output:

  1. The eval dataset structure
  2. The scoring methods per feature
  3. The CI workflow
  4. The quality threshold per feature
  5. The dataset-update process

The single biggest source of "the AI got worse" complaints: **prompt changes that regressed quality without anyone noticing.** Evals catch these before deploy. Without them, you find out via customer complaints — by then you''ve damaged trust.

---

## 7. Observe Production Traffic

Per [LLM observability providers](../../../VibeReference/ai-development/llm-observability-providers.md): instrument every AI call.

Design AI observability.

What to log per call:

  • Feature name (which AI feature was used)
  • User ID + workspace ID (subject)
  • Model used
  • System prompt name + version
  • User prompt (consider PII redaction)
  • Response text
  • Tokens (input + output)
  • Cost
  • Latency
  • Status (success / error)
  • Error message if failed
  • User feedback if collected (👍 / 👎)

Tools:

  • Langfuse / LangSmith / Helicone (per the comparison)
  • Or custom OTel pipeline

Dashboards to build:

  • Per-feature volume over time
  • Per-feature cost over time
  • Per-user top consumers
  • Per-prompt quality scores (from production user feedback)
  • Latency distribution (p50, p95, p99)
  • Error rate per feature

Alerts:

  • Cost spike (single user or feature uses 10x normal)
  • Error rate spike (provider issue or bug)
  • Latency spike (something slow)
  • Quality drop (if you have automated quality scoring)

The user-feedback layer:

Add 👍 / 👎 to AI outputs:

  • Click thumbs-up: log positive feedback with the call ID
  • Click thumbs-down: prompt for optional reason; log
  • Aggregate over time → quality metric per prompt
  • Failed cases → eval dataset addition

Privacy considerations:

Don''t:

  • Skip logging "for performance" (the cost is tiny)
  • Log to a file system you don''t monitor
  • Forget to redact PII from logs (privacy compliance)

Output:

  1. The logging schema
  2. The observability tool integration
  3. The dashboard layout
  4. The user-feedback UI
  5. The privacy policy update

The single most-actionable production signal: **the 👍 / 👎 ratio per prompt over time.** A new prompt that drops from 85% positive to 65% positive over a week is regressing; investigate. Without user feedback, you''re flying blind on quality.

---

## 8. Handle Failures Gracefully

LLM providers go down. Models return junk. Plan for it.

Design failure handling.

The patterns:

Provider outages:

  • Primary: Anthropic Claude
  • Fallback 1: OpenAI GPT-5
  • Fallback 2: Google Gemini

Gateway-managed (Vercel AI Gateway, OpenRouter, Portkey) handles failover automatically.

Quality failures (output is malformed):

  • Validate output structure before showing to user
  • Retry once with same prompt (variance often gives better result)
  • Retry with a "be careful about format" instruction
  • Fall back to a non-AI default
async function summarizeWithFallback(input: string) {
  try {
    const result = await generateText({
      model: 'anthropic/claude-sonnet-4-6',
      system: getPrompt('summarize.system'),
      prompt: input,
    })
    
    if (!isValidSummary(result.text)) {
      // Retry once
      const retry = await generateText({...})
      if (!isValidSummary(retry.text)) {
        // Fall back to non-AI summary
        return truncate(input, 200)
      }
      return retry.text
    }
    return result.text
  } catch (error) {
    // Provider down; fall back
    return truncate(input, 200)
  }
}

Latency failures:

  • Set timeouts (60s for chat; 30s for one-shot generation)
  • Show progress UI during long generations
  • Allow cancellation
  • After timeout: gracefully fall back

Hallucination handling:

  • Some features can detect hallucinations (e.g., extracting data from a doc — verify against source)
  • Add citation requirements ("output must include source") when accuracy is critical
  • Use retrieval-augmented generation (RAG) when factuality matters
  • Display confidence levels when available

The "model can''t" cases:

Sometimes the model genuinely can''t do what you''re asking:

  • Model returns "I cannot help with that" → trap and fall back
  • Model refuses (safety filter) → log; consider prompt change

Don''t:

  • Trust LLM output without validation
  • Show malformed output to users
  • Skip the fallback path
  • Give up on "the AI just doesn''t work today"

Output:

  1. The validation logic per feature
  2. The fallback hierarchy
  3. The timeout policy
  4. The hallucination-detection approach

The biggest user-trust signal: **graceful degradation when the AI fails.** A user who sees "AI is taking longer than usual; here''s a non-AI version while we retry" trusts the product. A user who sees a hung spinner and eventually a 500 error doesn''t.

---

## 9. Pick the Right Model for the Job

Top-tier models for everything = burns money. Tier the model selection.

Design model selection.

The pattern:

Cheap / fast (most tasks):

  • Claude Haiku 4.5 — extremely fast and cheap; fine for classification, short summaries, extraction
  • GPT-5-mini — competitive with Haiku
  • Gemini 2.5 Flash — Google''s cheap-fast option

Use for: 70-80% of AI tasks in indie SaaS. Most "smart features" don''t need top-tier reasoning.

Mid-tier (default):

  • Claude Sonnet 4.6 — strong default; good at most tasks
  • GPT-5 — equivalent class
  • Gemini 2.5 Pro — equivalent

Use for: chat interfaces, longer summaries, content generation, moderately complex reasoning.

Top-tier (specific needs only):

  • Claude Opus 4.7 — best reasoning; expensive
  • GPT-5 Pro / o1 — equivalent
  • Gemini 2.5 Ultra — equivalent

Use for: complex multi-step reasoning, code generation that''s actually hard, research-grade tasks.

Specialized:

  • Embeddings (text-embedding-3-small / cohere-embed-v3 / voyage-3) for vector search
  • Vision (Claude Sonnet 4.6 vision / GPT-5 vision) for image understanding
  • Audio (Whisper / Gemini audio) for transcription
  • Image gen (Recraft / Flux / DALL-E 3) for image creation

Selection logic in code:

function pickModel(feature: string, complexity: 'simple' | 'medium' | 'complex' = 'medium') {
  const tiers = {
    classify: 'anthropic/claude-haiku-4-5',
    summarize: 'anthropic/claude-sonnet-4-6',
    chat: 'anthropic/claude-sonnet-4-6',
    research: 'anthropic/claude-opus-4-7',
    extract: 'anthropic/claude-haiku-4-5',
  }
  return tiers[feature] || 'anthropic/claude-sonnet-4-6'
}

A/B test models:

  • For each feature, periodically test cheaper model
  • If quality matches, switch and save cost
  • If quality regresses, stay
  • Use evals (per step 6) to verify

Don''t:

  • Default to top-tier for everything (burns money)
  • Use cheap model for tasks requiring complex reasoning (poor quality)
  • Hardcode model in product code (use the gateway abstraction)

Output:

  1. The model-selection function
  2. The tier-to-feature mapping
  3. The A/B test plan
  4. The cost-vs-quality target per feature

The single biggest cost optimization: **using cheap models for simple tasks.** A team using top-tier for classification might spend 10x more than necessary. Run cheaper models against your evals; switch where quality is equivalent. Most teams overspend by 3-5x on model selection alone.

---

## 10. Quarterly Review

AI features rot. Quarterly review keeps them sharp.

Quarterly AI feature review.

Cost:

  • Per-feature cost trend
  • Per-tier cost vs revenue (margin per AI feature)
  • Top users by cost (anomalies?)
  • Provider mix (failovers triggered? cost-shifted?)

Quality:

  • 👍 / 👎 ratio per prompt over time
  • Eval scores per prompt
  • Customer-reported AI quality issues
  • Prompts that need updates

Performance:

  • Latency per feature (p50, p95, p99)
  • Streaming reliability (cancellations, errors)
  • Provider error rates

Adoption:

  • Per-feature usage rate
  • Features that nobody uses (kill them)
  • Features users want that don''t exist (build them)

Model updates:

  • New model releases that could replace current
  • Cheaper models for tasks where quality is sufficient
  • Specialized models worth piloting

Output:

  • Snapshot per feature
  • 1-2 prompt improvements
  • 1 model change (if cost / quality justifies)
  • 1 feature to deprecate or improve

---

## What "Done" Looks Like

A working AI-feature implementation in 2026 has:

- Clear product value per feature (no chatbot-for-the-sake-of-it)
- Gateway abstraction (provider/model strings, not direct SDK calls)
- Versioned prompt management with snapshot tests
- Streaming responses for user-facing features
- Per-tier quotas with kill-switch protection
- Eval workflow blocking regressing PRs
- Production observability with user-feedback signals
- Graceful failure handling with fallbacks
- Tiered model selection (cheap for simple; top-tier only when needed)
- Quarterly review baked into the team rhythm

The hidden cost in AI features isn''t the model bill — it''s **the trust damage from bad outputs that nobody noticed before customers did**. A team without observability and evals ships prompt regressions without warning. The discipline of "test every prompt; observe every call; fail fast on quality drops" turns AI from a liability into an asset. The infrastructure is the platform; the discipline makes it work.

---

## See Also

- [LLM Cost Optimization](llm-cost-optimization-chat.md) — companion topic
- [LLM Quality Monitoring](llm-quality-monitoring-chat.md) — companion topic
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — AI endpoints are highest-cost abuse vector
- [Multi-Tenant Data Isolation](multi-tenancy-chat.md) — workspace context for AI
- [API Keys & PATs](api-keys-chat.md) — programmatic AI access
- [Audit Logs](audit-logs-chat.md) — high-cost AI events logged
- [PostHog Setup](posthog-setup-chat.md) — track AI feature usage
- [Activation Funnel](activation-funnel-chat.md) — AI features drive activation
- [LLM Observability Providers](https://www.vibereference.com/ai-development/llm-observability-providers) — Langfuse / LangSmith / Helicone
- [AI Gateways](https://www.vibereference.com/cloud-and-hosting/ai-gateways) — gateway choice
- [Vercel AI Gateway](https://www.vibereference.com/cloud-and-hosting/vercel-ai-gateway) — Vercel''s offering
- [AI SDK](https://www.vibereference.com/ai-development/ai-sdk) — TS / Node SDK
- [AI SDK Core](https://www.vibereference.com/ai-development/ai-sdk-core) — generateText / streamText
- [Claude](https://www.vibereference.com/ai-models/claude.md) — Claude model details
- [Vector Databases](https://www.vibereference.com/backend-and-data/vector-databases) — for RAG
- [AI Memory Architecture Decision Framework](https://www.vibereference.com/ai-development/ai-memory-architecture-decision-framework) — for memory features

[⬅️ Growth Overview](README.md)