Cut Your AI SaaS LLM Costs Without Losing Quality

LLM Cost Optimization for Your New SaaS

Goal: Reduce your AI inference cost-per-customer by 40–70% without measurable quality loss. Move from "the LLM bill ate our gross margin" to "we have positive unit economics with room to spare." Without this, AI SaaS founders watch costs scale linearly with users and discover at $30k MRR that their cost-of-goods is 80% of revenue.

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: First-pass cost audit in 1 day. Major optimizations shipped over 30 days. Quarterly cost reviews baked in from launch.

Why Most AI SaaS Have Bad Margins

Three patterns hit AI founders the same way:

Default-to-frontier-model laziness. "Just use Claude Sonnet for everything" works when you have 10 customers. At 1,000 customers, 80% of those calls could run on Haiku at 1/10th the cost without quality loss — but nobody audited which calls actually need the frontier model.
No caching of deterministic responses. A user opening the dashboard might trigger 5 LLM calls per page load, 4 of which produce the same output every time. Cached, that's near-zero cost; uncached, it's 5× the bill.
Prompt bloat. Prompts grow from 200 tokens at launch to 2,000 tokens by month 6 as features get added. Each token in is also a token out (in the response that has to consider it). The bill scales with the prompt as much as with the request volume.

The fix is structural: instrument cost per call, audit which calls need which model, cache aggressively, optimize prompts, and watch the metric monthly.

This guide pairs with LLM Quality Monitoring (cost optimization without quality measurement is risky), PostHog Setup (PostHog's LLM observability captures the cost data), Usage-Based Billing (cost-of-goods directly determines pricing strategy), and Activation Funnel (where you see the user-side impact of latency optimizations).

1. Instrument Cost-per-Call First

You cannot optimize what you do not measure. The first move is wiring per-call cost tracking into every LLM invocation.

I'm building [your product] at [your-domain.com] using [Anthropic / OpenAI / mixed providers / Vercel AI Gateway]. Help me instrument cost tracking on every LLM call.

What to capture per call:

1. **Provider** — which model provider (Anthropic, OpenAI, Google, Replicate)
2. **Model** — exact model name + version (e.g., `claude-sonnet-4-5-20250929`)
3. **Feature** — which product feature triggered the call (report_generation, chat, summarization, etc.)
4. **User ID** — who made the call (NOT used for personalization here, used for per-user cost tracking)
5. **Plan tier** — free / pro / team — so you can analyze cost-by-plan
6. **Input tokens, output tokens** — provider returns these in every response
7. **Cost in USD** — calculated from the per-token rates table
8. **Latency** — how long the call took
9. **Cache hit?** — boolean for prompt caching, semantic caching, or response caching
10. **Outcome** — success / error / refusal

Wire this into a `llm_calls` table in [Postgres / Convex / chosen DB]. Do NOT log full prompts in production — they bloat storage and create privacy issues. Log:
- Hash of the prompt (for dedup analysis)
- First 200 chars of prompt (for debugging)
- Hash of the output
- First 200 chars of output

Alternative: use [PostHog's LLM observability](posthog-setup-chat.md) — same data, hosted dashboard, less custom work. Faster to ship for early stage.

Output:
- The schema for the `llm_calls` table (or PostHog event property structure)
- The cost-calculation table (per-million-tokens for each model I use)
- The per-feature wrapper code that captures all 10 fields
- The dashboard queries for the four most useful views: cost-per-customer-per-month, cost-per-feature, cost-per-model, cache-hit-rate

Cost-calculation reference (2026 directional):
- Claude Haiku 4.5: ~$0.80/1M input, ~$4/1M output
- Claude Sonnet 4.5: ~$3/1M input, ~$15/1M output
- Claude Opus 4.7: ~$15/1M input, ~$75/1M output
- GPT-4o-mini: ~$0.15/1M input, ~$0.60/1M output
- GPT-4o: ~$2.50/1M input, ~$10/1M output
- Update with actual rates from each provider before committing pricing.

Don't ship without item 7 (cost in USD). Founders who don't compute the per-call dollar amount can't optimize the right thing.

The dollar amount is the discipline. "1.2M tokens spent on this feature" is abstract. "$340/month spent on this feature" is actionable. Always carry through to actual dollars.

2. Audit Which Calls Need the Frontier Model

The single biggest cost win is usually moving 60–80% of calls from a frontier model (Sonnet, Opus, GPT-4o) to a smaller model (Haiku, GPT-4o-mini) where quality holds. Most teams discover they over-paid for capability they didn't need.

Audit which LLM calls in [my product] genuinely need the frontier model.

For each feature/call type in my codebase:

1. **What's the task?** Classification? Generation? Reasoning? Tool use? Code generation?
2. **What's the current model?**
3. **Could a smaller / cheaper model handle it?** Run the eval suite (per [LLM Quality Monitoring](llm-quality-monitoring-chat.md)) against the smaller model and compare.

The decision matrix for AI SaaS:

| Task type | Frontier needed? | Cheaper alternative |
|-----------|------------------|---------------------|
| Open-ended creative generation | Yes for high-value outputs | Smaller for drafts |
| Classification (5-50 categories) | No | Haiku / GPT-4o-mini handles 90%+ |
| Summarization (factual) | No | Haiku / GPT-4o-mini |
| Code generation (single file) | Maybe | Haiku for simple code; Sonnet for complex |
| Code review / refactoring | Yes | Frontier worth it |
| Tool selection (which tool to call) | No | Haiku / GPT-4o-mini |
| Routing / triage | No | Haiku / GPT-4o-mini |
| Extraction (structured output) | No | Haiku / GPT-4o-mini |
| Reasoning / multi-step planning | Yes | Frontier or specialized reasoning model |
| Translation | No | Haiku / GPT-4o-mini |

The pattern: **routing, classification, extraction, summarization → small model. Generation, reasoning, complex tool use → large model.**

For each feature in my product, output:
- Current model
- Recommended model based on the matrix
- Expected cost savings (compute the dollar delta)
- Eval test plan: which test cases must pass on the new model before switching

The migration pattern:
- Run eval suite on smaller model first
- If pass rate is within 2-3 points of frontier model, smaller wins
- Migrate behind a [feature flag](feature-flags-chat.md): 10% traffic to smaller model, monitor for 48-72h, expand if quality holds

Output: my prioritized model-migration list with estimated dollar impact per feature.

The migration discipline is what saves you from the wrong call. "Try Haiku for tool selection" without an eval suite produces "agents using the wrong tools 12% more often" — which costs more in customer trust than it saves in API spend. Always evaluate before migrating.

3. Use a Two-Tier "Router" Pattern

Even within a single feature, individual calls vary in complexity. The router pattern routes simple cases to a cheap model and complex ones to a frontier model.

Implement the router pattern for [my high-volume feature].

The architecture:

1. **Cheap classifier** — a small fast model decides which "track" each request goes on
   - Input: the user request
   - Output: simple/complex/uncertain
   - Cost: tiny (~$0.0001 per classification using Haiku)

2. **Cheap-track model** — Haiku / GPT-4o-mini for "simple" classification
   - 80%+ of requests should land here
3. **Expensive-track model** — Sonnet / GPT-4o / Opus for "complex" or "uncertain"
   - 15-20% of requests; the cost-per-call is higher but you're using the capability you're paying for

Routing logic example for an AI content tool:

\`\`\`ts
async function generateContent(userRequest: string) {
  // Cheap classification
  const route = await classify(userRequest, { model: "claude-haiku-4-5" });
  // Route enum: "simple_summary" | "complex_synthesis" | "creative_long_form"

  if (route === "simple_summary") {
    return await complete(userRequest, { model: "claude-haiku-4-5" });
  }
  if (route === "complex_synthesis") {
    return await complete(userRequest, { model: "claude-sonnet-4-5" });
  }
  // Creative long-form gets the heavy model
  return await complete(userRequest, { model: "claude-opus-4-7" });
}
\`\`\`

Implementation:
- Build the classifier prompt — keep under 50 input tokens, ask for JSON: \`{"route": "simple_summary"}\`
- Test the classifier on 50-100 historical requests. If the classifier mis-routes >10%, refine the prompt or use a slightly larger classifier model.
- Monitor per-route cost in the dashboard from Section 1.
- Adjust thresholds based on actual usage — if you're routing 95% to "simple" but quality is suffering, the classifier is too aggressive.

Real-world impact: routing typically cuts cost-per-call by 50-70% on workloads where simple/complex is bimodal. Doesn't help much if all requests are uniformly hard.

Output: the router code + the classifier prompt + the eval plan for verifying routing accuracy.

The router pattern is the highest-leverage architectural change for AI SaaS economics. The cheap classifier costs almost nothing; the routing decision saves real money on every downstream call.

4. Cache Aggressively (3 Layers)

Caching is where the unmodelled cost lives. Three layers compound:

Build three layers of caching for [my product].

**Layer 1: Prompt caching (provider-side)**

Anthropic's prompt caching and OpenAI's automatic input caching reuse the long static prefix of repeated prompts (system prompt, tool definitions, retrieved context). Massive savings when you reuse the same long prefix across many calls.

For Anthropic's prompt caching:

\`\`\`ts
const result = await anthropic.messages.create({
  model: "claude-sonnet-4-5",
  system: [
    {
      type: "text",
      text: longSystemPrompt,
      cache_control: { type: "ephemeral" }, // mark this for caching
    },
  ],
  messages: [{ role: "user", content: userQuery }],
});
\`\`\`

Cache hit savings: ~90% off input tokens for cached portion. Worth wiring on any system prompt > 1024 tokens.

For OpenAI: automatic for repeated prefixes; no flag needed but make sure your prefix is byte-identical across calls (no timestamps, no user IDs in the system prompt).

**Layer 2: Semantic caching (your-side)**

Cache responses by semantic similarity, not exact match. Two slightly different user requests with the same intent get the same cached answer.

Tools:
- Helicone (managed, plug-and-play)
- LangCache / Redis with embedding similarity
- Custom: embed the request, search a vector index of cached requests, return cache if cosine similarity > 0.95

Best for: chatbot FAQ-style products, support agents, anywhere users ask similar questions repeatedly. Saves 30-60% on call volume in those use cases.

**Layer 3: Application-level result cache**

For deterministic, idempotent operations: hash the (input, model, params) tuple and cache the response in Redis or Vercel Runtime Cache. TTL based on use case.

Critical exclusions:
- Cache hits should NOT be billed to the customer per [Usage-Based Billing](usage-based-billing-chat.md). The customer paid you nothing in API cost; passing through that cost is unethical.
- Personalized outputs (containing user names, account-specific data) usually shouldn't be globally cached; per-user-per-day cache is fine.

Output:
- The system-prompt structure to enable provider-side caching
- The semantic-cache wrapper or tool integration
- The application-cache for deterministic operations
- The cache-billing-exclusion logic that prevents billing customers for hits

For my product, identify which features are best candidates for each cache layer.

The provider-side prompt cache is the lowest-effort win. If your system prompt is 2,000 tokens and the same across all calls, enabling cache control saves 90% of the input cost on every cached call. Paste-and-ship in an hour.

5. Optimize Prompts for Cost (Without Hurting Quality)

Prompt bloat is the silent cost. Most prompts grow over time as features get added; few get audited.

Audit my top 5 prompts by call volume and cost.

For each prompt:

1. **Total tokens** (system + user template + injected context)
2. **What's actually doing work** vs filler
3. **Specific words to cut**:
   - Politeness words ("please," "could you")
   - Repeated instructions (some teams say the same thing 3 different ways)
   - Verbose explanations of what NOT to do (often longer than what TO do)
   - Examples that are no longer relevant
4. **What can be moved to system messages with cache_control**: anything static across calls

Optimization techniques:

1. **Replace examples with constraints**: instead of 3 examples (~200 tokens each), state the constraint in 1 sentence (~30 tokens). Often quality holds.

2. **Use bullet points over prose**: "do X. do Y. do Z." in 3 lines beats prose paragraphs. LLMs respond well to structure.

3. **Trim retrieved context**: if you're injecting RAG results, are you injecting top-10 when top-3 would do? Each unused chunk is paid input tokens.

4. **Move from prompt-engineering to fine-tuning** — for high-volume, narrow tasks, fine-tuning a smaller model may beat prompting a larger one on cost (and often quality). Consider if a feature is doing 100k+ calls/month with a stable pattern.

5. **Output truncation**: ask for shorter output explicitly. "Respond in under 100 words" vs vague "respond concisely" cuts output tokens (which are 3-5× more expensive than input).

Eval discipline:
- Every prompt change runs the eval suite (per [LLM Quality Monitoring](llm-quality-monitoring-chat.md))
- Document the before/after token count and quality scores
- Reject changes that save tokens but drop quality scores meaningfully

Output: per-prompt audit + the optimized prompts + the savings calculation per prompt.

The "shorter output explicitly" rule has outsized impact. Output tokens cost 3–5× input tokens for most providers. A prompt that asks for 200-word output instead of 800-word output saves more on output cost than the entire input prompt costs.

6. Use Provider Routing for Failover and Cost

Running multiple providers behind a gateway lets you optimize cost in addition to availability.

Set up multi-provider routing for cost optimization.

Use [Vercel AI Gateway](../../../VibeReference/cloud-and-hosting/vercel-ai-gateway.md), OpenRouter, or Portkey to route across providers programmatically.

Three routing strategies:

1. **Cost-first routing** — for tasks where multiple providers can handle the same job at the same quality:
   - Define a "model class" (e.g., "small-fast-classifier")
   - Route to the cheapest available among Haiku, GPT-4o-mini, Gemini Flash
   - Saves 10-30% just from arbitrage between providers

2. **Quality-first with cost-tiebreak**:
   - Run eval suite on the same task across providers
   - Pick the highest-quality; if tied, the cheapest
   - Re-evaluate quarterly as providers update

3. **Failover for availability**:
   - Primary: provider A
   - Failover: provider B (slightly different but acceptable)
   - During provider A outages, costs increase modestly but uptime stays at 99.9%+

Critical caveats:

- **Provider migration is not free.** Switching from Anthropic to OpenAI mid-product requires re-evaluating prompt behavior. Most prompts that work on Claude need adjustment for GPT.
- **Cost arbitrage often shifts.** What's cheapest today won't be cheapest in 6 months as providers re-price. Don't hard-code provider choices; route through the gateway abstraction.
- **Some features are provider-locked**. Tool use, vision, structured outputs, and reasoning modes have different APIs/capabilities per provider — routing through them transparently is harder than just text generation.

Output:
- Which features can be cross-provider (text generation, classification, summarization)
- Which features should stay locked to a provider (anything using provider-specific features like Claude's computer use, OpenAI's reasoning models)
- The gateway configuration with the routing rules
- The cost monitoring per provider over time

The arbitrage is real but smaller than people expect. The bigger win from gateway routing is operational: faster experimentation with new models, easier A/B testing, less work to migrate when a provider ships breaking changes.

7. Set Up Cost Alerts and Circuit Breakers

The disaster scenario every AI SaaS founder fears: a bug, abuse, or runaway agent generating $10,000 in inference cost overnight. Build the protection layer first.

Build cost-protection alerting and circuit breakers.

**Three layers of protection** (per [Usage-Based Billing](usage-based-billing-chat.md) Section 5):

1. **Per-customer caps**:
   - Soft cap: notify at 80% of expected daily usage
   - Hard cap: refuse further billable actions at 110% of expected daily usage (or paid quota)
   - Most-overlooked: a free-tier customer hitting frontier models 1000 times in an hour

2. **Per-feature anomaly detection**:
   - Cost-per-feature baseline (rolling 30-day average)
   - Alert if today's cost on any feature is >2x baseline
   - Catches feature-level bugs ("the new prompt accidentally generates 10× longer responses")

3. **Platform-level circuit breaker** (the runaway-agent prevention):
   - Trigger: any single user crosses [10×] their typical hourly LLM spend
   - Action: freeze the user's account, send Slack/email alert to me
   - Don't auto-resume; human review required
   - Saves you from the nightmare of "agent infinite-looped through a customer's workspace overnight, $14,000 in inference"

**Cost dashboards**:

Daily review (5 minutes):
- Yesterday's total LLM cost
- Top 10 customers by yesterday's cost (anomalies?)
- Top 5 features by cost (any feature spiking?)

Weekly review (15 minutes):
- 7-day cost trend
- Cost per paid user (your COGS-per-customer)
- Margin by plan tier (cost / price)

Monthly review (45 minutes):
- Audit all 17 sections of this guide — is each optimization still in place?
- Run a "what if we routed everything to Sonnet?" cost projection — by how much would that increase costs? Quantifies what you're saving from the optimizations.
- Quarterly rate update from providers — re-check pricing assumptions in cost calculator from Section 1.

For each layer, output:
- The trigger conditions
- The notification mechanism
- The action taken (alert / freeze / refuse)
- The escalation path (founder review vs auto-handled)

Test the circuit breakers in staging once per quarter. Without testing, you don't know if they actually fire.

The platform-level circuit breaker (item 3) is non-negotiable. Two real 2025–2026 incidents that have happened: an agent infinite-looping through a customer account ran up $14k in a weekend; a misconfigured prompt template generated 200M tokens before anyone noticed. Both would have been stopped within an hour by a 10× hourly-usage circuit breaker.

8. The Honest Cost-Per-Customer Math

After the optimizations, run the cost-per-customer math honestly. Most AI SaaS underprice because they didn't model this rigorously.

Compute my honest cost-per-customer.

For each pricing tier:

1. **Average LLM cost per customer per month** (rolling 30-day average from cost-tracking data)
2. **Other COGS** (Stripe fees ~3%, hosting, support, tooling)
3. **Total COGS per customer**
4. **Price per customer**
5. **Gross margin per customer** = (price - COGS) / price

Healthy SaaS gross margin: 70-85%
Healthy AI SaaS gross margin: 50-70% (model costs are a real drag — if you're at 80%, you might be over-pricing or over-restricting product)

If gross margin is below 50%:
- Pricing is too low for the model usage being consumed (raise prices per [Raise Prices](https://www.launchweek.ai/convert/raise-prices), or implement [Usage-Based Billing](usage-based-billing-chat.md))
- Free tier is too generous (tighten quotas)
- Product is over-using the frontier model where smaller would do (Section 2)
- Caching is not implemented (Section 4)

If gross margin is above 80%:
- Probably under-using LLM capability — consider richer features that pay for themselves
- Or your pricing might be too high for what you're delivering — risk of competitors undercutting

For my product, output:
- The actual gross margin calculation per tier
- Where the margin is high vs low
- The 3 highest-leverage optimizations to ship next
- A 90-day projection of margin if those optimizations land

Quarterly review the actual numbers vs the projection. Margin should improve quarter-over-quarter as optimizations compound; if not, audit which optimization decayed.

The single most important number for an AI SaaS founder: cost-per-customer-per-month. Knowing this enables every other decision. Founders who can't quote it are flying blind on unit economics.

Common Failure Modes

"We're using Sonnet for everything." Section 2 audit. Most calls don't need it. Move classification, routing, extraction to Haiku-tier; keep frontier for genuinely hard tasks.

"Our system prompt is 2,500 tokens and not cached." Lowest-effort win in the category. Section 4 layer 1 — flip on prompt caching, save 90% of input cost.

"Same user re-asks the same question 50 times in a session." No semantic caching. Section 4 layer 2.

"A customer ran up $4,000 in inference overnight." No platform-level circuit breaker. Section 7. Apply credit to the customer; ship circuit breaker the same day.

"Our cost-per-customer was $8/month at 100 customers; it's $14/month at 1,000." Cost is scaling worse than linear, which means power-user behavior is hitting your margins. Tighten free-tier limits, add usage-based overages, audit what those expensive customers are doing.

"We routed to a cheaper provider and quality dropped." Skipped the eval step. Always run the eval suite (per LLM Quality Monitoring) before migrating prompts to a different provider.

"We don't track cost per call." Section 1. Until you can answer "what does it cost when a customer presses [Generate]?" you can't optimize anything.