Monitor LLM Quality in Production

LLM Quality Monitoring for Your AI Product

Goal: Catch AI-output quality regressions before customers do. Set up the eval suite, the production-monitoring layer, and the alerting that surfaces "the model is now producing wrong outputs" within hours of it happening — not weeks. Without this, AI SaaS products quietly degrade until enough customers churn that you finally notice.

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: First eval suite running in 1 day. Production monitoring wired in week 2 of launch. Quality dashboard reviewed weekly from launch onward.

Why Most AI SaaS Products Quietly Degrade

LLM-powered products are unique in software: the same code with the same inputs can produce different outputs as the underlying models, prompts, or context change. Three failure modes that hit AI SaaS founders:

Model providers ship silent updates. Anthropic, OpenAI, and Google quietly tune their models. Your prompt that worked perfectly on Tuesday produces subtly worse output on Wednesday — same model name, same parameters, different behavior. You learn about it from a customer email a month later.
Edge cases drift in real traffic. The 95% of inputs that work fine mask the 5% that produce hallucinations, refusals, or off-topic responses. Without sampling and reviewing real outputs, you never see the bad ones.
Prompt changes go unmeasured. A founder tweaks a prompt to fix one customer complaint. The change improves that case but breaks three others. Without a regression suite, the breaks aren't caught until they accumulate.

The fix is structural: a small suite of "golden" test cases run on every model / prompt change, plus a continuous sample of real production outputs scored for quality, plus alerting when scores drift.

This guide pairs with Activation Funnel Diagnosis (one common cause of activation drop is silent quality regression), Customer Support (support tickets are leading indicators of quality issues), and Reduce Churn (silent quality drift is one of the strongest unmeasured causes of B2B AI churn).

1. Define What "Quality" Means for Your Product

LLM quality is product-specific. A summary tool's quality is different from a code generator's, which is different from a customer support agent's. Define your dimensions before you build anything.

I'm building [your product] at [your-domain.com]. The product uses [Claude / GPT / Gemini / mixed] to [specific use case — e.g., generate marketing emails, answer customer questions, write code].

Help me define 3-5 quality dimensions that matter for my product. Each dimension should be:

1. **Specific to my use case** (not generic "is this good?")
2. **Measurable in some way** — either a deterministic check, a rubric for human review, or an LLM-as-judge prompt
3. **Tied to a customer outcome** — bad scores on this dimension mean bad customer experience

Common dimensions to consider for my product type:

For text generation (emails, content, summaries):
- **Factual accuracy** — does the output state things that are true?
- **Tone match** — does the output match the requested voice / brand / formality?
- **Format compliance** — does the output respect the structural constraints (length, headings, format)?
- **Relevance** — is the output on-topic for what was asked?
- **Hallucination rate** — what % of outputs contain made-up facts, sources, or references?

For code generation:
- **Compilability / syntactic correctness** — does the code compile / parse / type-check?
- **Functional correctness** — does the code do what was asked?
- **Style adherence** — does it match the project's existing style?
- **Hallucinated APIs / packages** — does it reference packages or methods that don't exist?

For agent / tool-use products:
- **Tool selection accuracy** — does the agent call the right tool for the request?
- **Tool input correctness** — are the arguments to the tool correct?
- **Final-answer faithfulness** — does the answer actually use what the tools returned?
- **Cost discipline** — does the agent finish in a reasonable number of steps?

For customer-facing chatbots:
- **Answer accuracy** — verified against my docs / KB
- **Refusal appropriateness** — refuses what it should, doesn't refuse what it shouldn't
- **Tone safety** — no offensive, off-brand, or inappropriate output
- **Response specificity** — refers to the user's actual situation, not generic advice

For my product, output:
- 3-5 specific quality dimensions
- For each, the measurement approach (deterministic / human / LLM-as-judge)
- For each, the threshold I should alert on (e.g., "factual accuracy below 90% over 24h")

Output: a 1-page quality definition I can pin and reference.

The single biggest mistake here is using only one dimension. "Is the output good?" averages over different failure modes that need different fixes. Three to five specific dimensions surface specific regressions; one generic score hides them.

2. Build a Golden Eval Set

The foundation of LLM quality work is a small, hand-curated set of test cases — your "golden set" — that you run on every change. Without this, you're flying blind on every prompt edit.

Build a golden eval set for [my product] using my quality dimensions from Section 1.

Composition:
- 20-50 test cases for the first version (more is better, but 20 is enough to start)
- Spread across the major use-case categories my product handles
- Include 5-10 known hard cases — inputs that have produced bad outputs in the past, customer-reported failures, or edge cases I've spotted in production

For each test case, define:

1. **Input** — the actual user input or context my product would receive
2. **Expected output** — either:
   - **Exact match**: for deterministic outputs (e.g., "the answer should include the phrase X")
   - **Rubric**: for non-deterministic outputs (e.g., "the answer should be polite, factual, and under 200 words")
   - **Reference output**: a high-quality answer I've reviewed myself — used as the comparison for LLM-as-judge scoring
3. **Quality dimension(s)** this test case covers
4. **Tags** — category, hardness, source (real customer / synthetic / edge case)
5. **Owner** — me, until the team grows

Storage:
- One YAML / JSON file per test case in a `evals/` directory in my repo
- Or a simple Postgres / Supabase table if I want to manage them in a UI
- Either way, version-controlled with the rest of my code

Sources for the first 20-50 cases:
- 5-10 from my customer interviews (real questions / inputs they would actually send)
- 5-10 from real production traffic (sampled — see Section 4)
- 5-10 hand-crafted edge cases (very long inputs, unusual formats, tricky language)
- 5-10 "happy path" basics — the most common requests, to detect breaking regressions

Output: the eval-set schema, a YAML template for each test case, and 5 sample test cases drafted from my product's actual use cases.

The golden set should grow organically: every time a customer reports a bad output, every time I find a regression, every time I identify a new edge case — that case becomes a permanent member of the golden set. Within 6-12 months you'll have 200+ cases that catch real regressions.

3. Run Evals on Every Prompt or Model Change

The discipline that separates teams who ship reliable AI from teams who don't: never push a prompt change or model swap without running the eval suite first.

Wire the eval suite into my development workflow.

CI integration:

1. **Local script** — `npm run evals` or `bun evals` runs the full golden set against my current prompt + model
2. **PR check** — every PR that touches `prompts/` or model-config files automatically runs the suite. CI fails if any test case regresses
3. **Pre-deploy gate** — before any production deploy that changes AI behavior, the suite must pass

Eval-suite implementation:

For each test case:
- Run the actual production code path (or a faithful replica) with the test input
- Capture the output
- Score it against the expected/rubric/reference using:
  - **Deterministic checks**: regex / substring / format-validation (cheap, fast, reliable for what they catch)
  - **LLM-as-judge**: a separate Claude / GPT call asks "does this output meet the rubric?" with the rubric and the actual output (good for nuanced quality but adds cost and noise)
  - **Embedding similarity**: for "did the output stay on topic?" or "is this output close to the reference?"

Tools:

- **OpenAI Evals** — open-source, fairly framework-y, good if I'm OpenAI-only
- **Promptfoo** — most popular open-source eval tool in 2026, works across providers, declarative YAML, CI-friendly
- **Braintrust** — managed platform, integrates evals + production monitoring + dashboards. Worth it past 100 customers
- **LangSmith** — if I'm on LangChain
- **Custom** — for solo founders, a 100-line eval runner is fine and avoids tool lock-in

For my stack, recommend ONE eval tool with rationale. Default if no strong reason: Promptfoo for solo / small team; Braintrust if I want managed dashboards.

CI configuration (assuming Promptfoo):
- `.github/workflows/evals.yml` runs `promptfoo eval` on every PR
- Fails CI if pass rate drops below [threshold, typically 95%]
- Posts the eval results as a PR comment so the change author sees what regressed

Output: the eval-runner code, the CI workflow file, the per-dimension scoring functions.

The non-obvious cost: LLM-as-judge evals run Claude / GPT against every test case, which costs real money. Budget a few cents per eval-suite run; running on every PR can add up. Use cheaper models (Claude Haiku, GPT-4o-mini) for the judging step where possible.

4. Sample Production Traffic

The golden set catches known regressions. Production traffic catches the unknown ones. Set up a continuous sample of real outputs scored for quality.

Build production-traffic sampling for [my product].

Sampling strategy:

1. **Random sample** — sample [1-5%] of all production AI outputs into a `production_samples` table
   - For each: input, output, model used, latency, customer ID (anonymized), timestamp
   - 1-5% is enough volume to detect drift without exploding storage / review costs

2. **Always-sample triggers** — sample 100% of:
   - Outputs that the user thumbs-down / reports / regenerates (they tell me "this was bad")
   - Outputs from new customers in their first 14 days (high-stakes activation moment)
   - Outputs from high-value customers (Enterprise, top 10% by usage)
   - Outputs after a model or prompt change (first 24-48h post-deploy)

3. **Score each sample** automatically using LLM-as-judge:
   - Per quality dimension from Section 1
   - Score 1-5 with brief reasoning
   - Cost: ~$0.001-0.01 per sample depending on input/output length and judging model

Storage and review:

- `production_samples` table with sampled inputs + outputs + scores + metadata
- An admin dashboard at /admin/quality showing:
  - 7-day trend per quality dimension
  - Score distribution (bell curve / histogram)
  - The lowest-scoring 20 samples in the last 7 days, sorted by score ascending — this is the review queue
  - Filters by model, prompt version, customer segment, time

Weekly review (45 minutes):

- Pull the lowest-scoring 20 samples from the past week
- For each: was the score correct? If yes, what's the failure pattern? If no, recalibrate the judge prompt
- Identify any failure that recurs in 3+ samples — that's a pattern worth investigating
- Add representative cases to the golden set (Section 2)

Output: the sampling code, the admin dashboard component, the weekly-review template.

The "always sample triggers" matter more than the random rate. Random sampling at 1% catches you on aggregate trends; the always-sample triggers catch high-stakes individual failures (a prominent customer's bad output) before they become incidents.

5. Wire Real-Time Alerting

Sampled scores are useful in retrospect; real-time alerts are useful when the model has just degraded.

Set up alerting for AI quality regressions.

Alert rules:

1. **Pass-rate drops on golden set** — if the eval suite that runs nightly produces a pass rate below my threshold, alert me. 30 minutes after the model provider has shipped a silent update is when I want to know — not 30 days later from a customer email.

2. **Production sample score drops** — if the rolling 24-hour mean score on any quality dimension drops below my threshold, alert. Use a simple statistical test — compare last 24h to trailing 7-day baseline; if 24h is more than 2 standard deviations below, page.

3. **Specific high-value pattern** — if any sample from a high-value customer scores below [3 / 5], alert immediately. Reach out personally before they churn quietly.

4. **Cost / latency anomaly** — if cost per output or median latency jumps significantly, alert. Often correlated with quality issues (the model is now using more tokens to compensate for confusion, or the gateway is failing over to a slower provider).

Where alerts go:

- Slack channel `#ai-quality-alerts` or equivalent — single channel, all team members watch
- For solo founders: SMS / push notification on critical alerts only
- Don't email — alerts via email get ignored within a week of starting

Alert hygiene:

- Each alert includes: the metric, the threshold, the actual value, a link to the relevant dashboard or sample, and a "mute for 1h" / "snooze for 24h" option (alerts that can't be muted get muted by ignoring)
- Track false-positive rate. If more than 20% of alerts are false alarms, tighten thresholds or improve the metric. Alert fatigue is the killer of monitoring systems.
- Run a "drill" once a quarter: deliberately introduce a quality regression in staging, confirm the alert fires within the expected window. If it doesn't, the system is broken.

Output: the alert-rule definitions, the Slack / SMS integration code, the false-positive tracking spreadsheet template.

The drill quarterly is the rule most teams skip. Monitoring systems decay silently — thresholds drift, integrations break, alerts go to channels nobody reads anymore. The drill is the only way to know your monitoring is still working.

6. Build the Quality Improvement Loop

Detection without action is theater. Set up the operating rhythm that turns detected regressions into shipped fixes.

Build a weekly quality improvement loop.

Monday (45 minutes):

1. **Review the past week's quality dashboard** — score trends per dimension, alert volume, false-positive rate
2. **Triage the bottom 20 samples** from production sampling — what's failing, why, which dimension
3. **Pick the highest-leverage improvement** — the failure pattern that affects the most customers or highest-value cases
4. **Open a ticket** with: the failure description, 3 example sample IDs, the proposed fix (prompt edit, model change, RAG context tweak), the success metric

Wednesday or Thursday:

1. **Implement the fix** — usually a prompt change or context-retrieval change
2. **Run the full golden set** — verify no regressions on existing cases
3. **Add the failure pattern as new test cases** to the golden set so it can never silently recur
4. **Deploy** — to staging first, then production. Rollout pattern: 10% traffic first, monitor for 24h, then 100%

Friday:

1. **Reverify the fix worked** — the new samples this week should score higher on the dimension that was failing
2. **Document the change** in the team's knowledge base — what was failing, why, what fixed it. Builds institutional memory across changes.

Quarterly (90 minutes):

1. **Audit the golden set** — are there test cases that haven't failed in 6+ months and don't represent edge cases anymore? Trim them to keep the suite fast.
2. **Audit alert thresholds** — should they be tighter (fewer false negatives) or looser (fewer false positives)? Adjust based on the previous quarter's accuracy.
3. **Audit dimension definitions** — has my product changed enough that some quality dimensions should be added, removed, or redefined?

Output: the recurring weekly + quarterly templates I can fill in.

The "add the failure pattern as new test cases" rule is the compounding move. Six months in, your golden set has 100+ test cases that each represent a real regression you caught, and the suite catches every type of error you've ever seen.

7. Handle the Model-Provider Update Problem

Anthropic, OpenAI, and Google ship silent model updates. Defending against this is its own discipline.

Build a model-provider-change defense layer for [my product].

Three protections:

1. **Pin model versions, not aliases**
   - Don't use `claude-sonnet-4` (alias to latest)
   - Use `claude-sonnet-4-5-20250929` (specific dated version)
   - Same for OpenAI: pin to specific version stamps, not the floating model name
   - Update intentionally, with eval-suite verification, not automatically

2. **Run a daily "model health check"**
   - Subset of 10-20 golden tests run nightly against your pinned model
   - If pass rate drops, investigate immediately — usually means the provider has changed behavior even on a pinned version (rare but happens, especially with safety updates)
   - Your eval-suite from Section 3 is the input; this is just running it nightly on a cron

3. **Plan for provider migrations**
   - When a new model version (e.g., Claude Sonnet 5) ships:
     - Run the full eval suite against the new model
     - Compare scores per dimension to the current pinned model
     - If new model scores higher on the dimensions I care about, plan a migration
     - Migrate behind a feature flag, 10% traffic for 48 hours, monitor production samples, expand to 100%
   - If migration improves quality, lock in. If it doesn't, defer.
   - Use [Vercel AI Gateway](../../../VibeReference/cloud-and-hosting/vercel-ai-gateway.md) or similar for fast model swaps with consistent observability

For my product:
- Document my pinned model version
- Set up the nightly health-check cron
- Write the model-migration runbook so future-me (or a teammate) knows exactly how to evaluate a new model

Output: the version-pinning code, the cron-job spec, the migration runbook template.

The "pin specific dated versions" rule is non-obvious but important. The "claude-sonnet-4" alias points to whatever Anthropic considers the latest 4-tier model — on the day they ship 4.5, your code silently uses the new model with possibly different behavior. Pinning the dated version ensures changes are intentional.

Common Failure Modes

"We have no eval suite." Highest priority before anything else. Even 10 hand-crafted test cases beats nothing. Section 2 in 1 day.

"We have an eval suite but never run it on prompt changes." Wire it into CI per Section 3. The discipline of "PRs fail if evals regress" is what gives the suite value.

"Customers report bad outputs but we can't reproduce them." Production sampling is missing. Section 4. Without sampling, you only see the outputs angry customers email about — usually 1% of bad outputs.

"Alerts fire constantly and we ignore them." Alert fatigue. Either the thresholds are too tight or the metrics are too noisy. Tune thresholds based on actual false-positive rates; mute the noisy alerts entirely if they're not actionable.

"We changed prompts and it improved one case but broke three others." No regression test. The golden set + CI gate from Sections 2-3 is exactly the fix.

"Our model provider made an update and quality dropped — we found out from a customer." No model-provider-change defense. Section 7's pinning + nightly health checks would have caught this in 24 hours instead of 14 days.

"Our quality scores look fine but customers churn anyway." Quality dimensions don't match what customers care about. Re-derive Section 1's dimensions from real customer feedback and support tickets — what dimension would that complaint have failed on?