Set Up Your Status Page and Incident Response

Status Page and Incident Response for Your New SaaS

Goal: Stand up a public status page that builds trust during incidents instead of damaging it. Wire the alerting, the response runbooks, and the customer-communication discipline that turn a 30-minute outage from a PR disaster into a non-event. Without this, every minor outage in your first year of operation feels existential and the customers find out about issues before you do.

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: V1 status page live in 1 day. First runbook in week 2. Quarterly incident-response drills baked into the calendar from launch onward.

Why Solo Founders Get This Wrong

Three failure modes hit indie SaaS founders consistently:

No status page until the first big outage. Customers report "your product is down" via Twitter, Slack, and email. The founder runs around triaging individual reports while also fixing the actual bug. Communication is fragmented; trust is destroyed even after the fix.
Status page exists but is never updated. Page lives at status.your-domain.com, gets visited once per quarter (during launches), and stays green during outages because the founder is heads-down debugging. Worse than no page — confirms to customers that the page is theatre.
Incident response is panic-driven. Each outage is a fresh fire-drill: which logs do I check first? Who do I tell? What do I post? The work becomes 4× harder because there's no playbook.

The version that works is structural: the status page lives at a known URL, gets updated by automation when possible, and gets manually updated within 5 minutes during incidents. Each incident type has a runbook. Customers find out from the status page before they find out from each other.

This guide pairs with Customer Support (incident comms route through the same channels), Data Trust (incident response is the operational arm of the security posture you publish there), and Reduce Churn (well-handled incidents reduce trust loss meaningfully versus poorly-handled ones).

1. Pick a Status Page Tool

Five real choices in 2026, all with significantly different cost/feature curves.

I'm building [your product] at [your-domain.com]. My current customer count is [N], with [N] growing to [N] in 12 months. My ICP is [B2B / B2C / developer / enterprise].

Help me pick a status page tool from these five:

1. **Statuspage by Atlassian** — the established default. $29-$1,500/month depending on subscribers and features. Mature, integrates with PagerDuty / Opsgenie / many tools. Best for B2B with mid-market+ buyers who recognize the name.

2. **Instatus** — modern, less expensive than Statuspage. $20-$300/month. Better default UI, faster setup, good integration story.

3. **Better Stack (formerly BetterUptime)** — uptime monitoring + status page in one. $24-$200/month. Good for solo founders who want monitoring + status under one bill.

4. **Self-hosted (Cachet / Cstate / your own)** — open source. $0/month + ops time. Best when budget is the constraint OR when status page itself needs to be on a separate cloud / region from the main product (so it stays up when prod is down).

5. **Vercel + custom Next.js page** — roll your own. Full control, low cost, more dev time. Some teams build a `/status` route in their main app; this is fine for early stage but FAILS during full outages because the same infrastructure that took down prod might take down the status page too.

For my specific case:
- Recommend ONE tool with rationale
- The realistic monthly cost at my current and 12-month scale
- The setup time (1-2 hours for tools 1-3; 4-8 hours for self-hosted; 1-2 days for custom)
- Critical principle: the status page must NOT live on the same infrastructure as the product. If both go down together, you cannot communicate with customers.

Default if no strong reason: Better Stack at the indie tier. Combines uptime monitoring + status page; the integration is the value-add.

The non-negotiable principle: status page lives on different infrastructure than the product. The most embarrassing failure mode is "we had an outage and our status page was also down because it was hosted on the same region of the same cloud." Tools 1, 2, 3, and 5 are all fine — they're hosted on third-party infra. Self-hosted and rolling your own require explicit attention to this constraint.

2. Define What the Status Page Tracks

Most status pages are vague — "API: Operational" with no detail. Useful status pages decompose into the components customers actually care about.

Define the components my status page tracks. These should map to customer-perceivable surfaces, not internal architecture.

For my product, identify 4-8 components. Each should be:
- Something a customer would notice when it breaks
- Distinct enough that I can mark one degraded while others stay green
- Mapped to specific monitoring checks (so updates can be automated)

For an AI SaaS product, typical decomposition:

- **Web Application** — the user-facing dashboard / app
- **Public API** — programmatic access endpoints
- **AI Generation** — the core LLM-powered feature(s) — separately from the web app because the model provider can have outages
- **Authentication** — login, signup, password reset
- **Billing & Payments** — Stripe integration, invoice generation
- **Webhooks Outbound** — events we send to customer integrations
- **Admin Dashboard** — internal tooling (only show this if customers depend on it)

For an additional dimension: STATUS LEVELS

Each component has a state, refreshed in real time:
- **Operational** (green) — everything fine
- **Degraded Performance** (yellow) — slow but working
- **Partial Outage** (orange) — some users / regions affected
- **Major Outage** (red) — broken for most users

The honest discipline: if response times are 3× normal but no errors are returning, mark "Degraded" — even though everything technically "works." Customers can tell something's wrong; pretending it's green when they're experiencing slowness costs trust.

For my product, output:
- The 4-8 components I'll track
- The status levels and what triggers each
- The monitoring checks that will auto-update each component (per Section 3)

The "Degraded for slow but working" rule is the discipline most teams skip and the reason their status pages get distrusted. Customers feeling slowness while the page is green tells them the page is theatre. Yellow when slow → green when fast preserves the trust contract.

3. Wire Automated Monitoring

Manual status updates are unreliable — at 3am, when you're asleep, an outage could be in progress for two hours before you even know to update the page. Automated monitoring catches this.

Wire monitoring that auto-updates the status page when checks fail.

Synthetic monitoring (3-5 checks per component):

1. **Public uptime check**: simple HTTP GET on a healthcheck endpoint, every 1-2 minutes from multiple regions.
   - Use: Better Stack, Pingdom, UptimeRobot, or your status-page tool's built-in monitoring
   - Endpoint to expose: `GET /api/health` returning `200 OK` if the service is up; `503` if a hard dependency (DB, AI provider) is down
   - Triggers status update: Web App → Operational/Degraded/Major automatically

2. **Functional / synthetic check**: simulates a real user action (login, generate a small AI output, fetch a record). More expensive but catches issues that simple uptime checks miss (auth broken, AI provider returning bad outputs, DB returning empty results).
   - Use: Checkly, Better Stack synthetic checks, your own scheduled cron via [Vercel Functions](../../../VibeReference/cloud-and-hosting/vercel-functions.md)
   - Run every 5-10 minutes from at least 2 regions (avoids region-specific false positives)

3. **Error rate threshold**: if your error tracking (Sentry, Honeybadger) shows >5% error rate for >2 minutes, auto-update the relevant component to Degraded.
   - Use: Sentry alert webhooks → status page API to update the component

4. **Provider dependency check**: if your AI provider (Anthropic, OpenAI) is down, you're down. Subscribe to their status pages and either auto-update yours or get notified to update manually.
   - Critical: don't blame your customers if Anthropic has an outage — be transparent

5. **Database / queue health**: if your DB is failing connections or queue depth is climbing, the Web App component degrades.

Implementation notes:
- Use webhooks from monitoring tool → status page API to auto-create incidents and toggle component status
- Set thresholds high enough to avoid false positives — if you're notified for every 30-second blip, the alerts get ignored
- 5+ consecutive failures over 2-5 minutes is a reasonable starting threshold

Output: the monitoring config + the webhook integration + the threshold rules. For each component from Section 2, name the specific check that drives its status.

The synthetic check is the differentiator. Uptime checks (HTTP 200 OK) miss the failure modes where the API responds successfully but with bad data or wrong-shaped responses. A synthetic that "logs in, generates a small AI output, verifies the output is not empty" catches what simple checks don't.

4. Write Incident Runbooks (Before You Need Them)

The 3am incident is the worst time to figure out what to do. Write the runbooks now, while you're calm.

Build an incident runbook for [my product]. Cover the 4-6 most likely incident types.

For each, the runbook structure:

1. **Detection** — what triggers this incident? Which alert / which check?
2. **Initial assessment** — what to verify in the first 5 minutes:
   - What's the actual scope (1 user, 10%, all)?
   - Are dependencies up (Anthropic? Stripe? DB?)?
   - Did anything change recently (deployment in last 4h)?
3. **Communication** — what to post to the status page, when:
   - Within 10 minutes: "Investigating [issue]" — even if you don't know the root cause yet
   - As you learn: update with "Identified [cause]" / "Implementing fix" / "Monitoring"
   - Resolution: "Resolved" with brief postmortem-style note
4. **Mitigation** — the immediate "stop the bleeding" actions, in order
5. **Resolution** — the path to actual fix
6. **Post-incident** — what to do once resolved (write postmortem, update runbook based on learnings, notify customers individually if their data was affected)

The 4-6 runbooks I should have ready (for an AI SaaS):

1. **AI Provider Down** (Anthropic / OpenAI / Replicate has an outage)
   - Detection: provider status page or our errors spiking with provider-specific error codes
   - Mitigation: switch to backup provider via [AI Gateway](../../../VibeReference/cloud-and-hosting/vercel-ai-gateway.md), or queue requests for retry, or show graceful degradation in UI ("Service slow, please retry shortly")
   - Communication: "We're seeing slowness from our AI provider. We're routing around it. ETA: per [provider]'s status page"

2. **Database Down or Slow**
   - Detection: connection failures or query latency >10× normal
   - Mitigation: connection pool reset, read replica failover, query-level circuit breaker
   - Communication: "Database connectivity issues affecting [components]"

3. **Deployment-Caused Outage**
   - Detection: errors spiked immediately after a deploy
   - Mitigation: rollback to the prior deploy first, then investigate the bug. NEVER try to fix forward in the middle of an incident.
   - Communication: "We deployed an update that introduced [issue]. Rolling back."

4. **Payment / Billing System Down**
   - Detection: Stripe webhook failures or signups/upgrades returning errors
   - Mitigation: pause new signups gracefully (better than cards being charged with no account created), Stripe status page check
   - Communication: "Sign-ups and upgrades are temporarily unavailable. Existing customers unaffected."

5. **Security Incident** (per [Data Trust](data-trust-chat.md) — separate from regular ops)
   - Detection: unusual access patterns, customer reports of unauthorized access, security tool alerts
   - Mitigation: rotate keys, force-logout affected users, snapshot state for forensics
   - Communication: under-promise / over-deliver; specific facts only, no speculation; lawyer involvement if data was exfiltrated. Don't post to status page reflexively — security incidents need careful handling.

6. **Mass Quality Regression** (silent model degradation per [LLM Quality Monitoring](llm-quality-monitoring-chat.md))
   - Detection: production sample scores drop, customer complaints spike
   - Mitigation: revert to last-known-good prompt or model version
   - Communication: "We're investigating reports of degraded AI output quality"

For each runbook, output:
- The trigger
- 5-step response checklist
- The status page communication template
- The post-incident checklist
- Owner (founder for early stage; later: rotates among team)

Save these as a single runbooks.md in your repo. Update after every incident.

The "rollback first, investigate later" rule is the most-violated. Founders try to fix-forward during outages because they think they understand the bug. Rolling back first restores service immediately and reduces customer impact while you investigate calmly.

5. Communicate During Incidents

The communication during an incident matters as much as the actual fix. Done well, customers walk away with more trust than before; done badly, they churn.

Build my incident communication playbook.

Three communication surfaces, each with different cadence and audience:

1. **Status page** — the canonical record. Public, indexed, customers re-check it.
   - First post within 10 minutes of detection — even if you don't know the cause: "Investigating [component]" is fine
   - Update every 30-60 minutes minimum during active incidents (more often for high-severity)
   - Status updates: Investigating → Identified → Monitoring → Resolved (these are the canonical phase names; use them consistently)
   - After resolution: brief postmortem in the same incident thread; don't open a new page just to close

2. **Direct customer notification** (email or in-app banner) — for incidents lasting >30 minutes or affecting >50% of users
   - Email subject: "[Product] service update — [brief summary]"
   - Body: what's happening, what we're doing, ETA if known, where to track (link to status page)
   - Sender: from your real address, not noreply@
   - For shorter / less-impactful incidents, the status page alone is fine

3. **Twitter / public** — for major outages with broad impact
   - Single thread, updated alongside the status page
   - Honest, calm tone — never defensive, never over-promising
   - Link to the status page where customers can subscribe to updates

Communication principles that build trust during incidents:

- **Be specific about what's broken**: "Login is failing for ~30% of users in EU region" beats "We're experiencing issues"
- **Be specific about what's NOT broken**: "Payments processing normally; existing sessions unaffected" prevents customers from worrying about things that are fine
- **Acknowledge before you understand**: "We see what's happening, we're investigating" is fine; silence is not. Even if you have nothing yet, post that you're aware.
- **Avoid technical jargon publicly**: "DB connection pool exhausted" reads as obscure; "Database connectivity issue" is clear. Save the technical details for the postmortem.
- **Don't apologize early**: "We're investigating" is honest; "We deeply apologize for this critical failure" sounds dramatic and overcommits to a narrative you might revise. Apologize after resolution, in the postmortem.
- **End with what next**: "We'll publish a full postmortem within [X days]" sets expectations.

For my product, output:
- 5 incident communication templates (status page first post, mid-incident update, resolution announcement, email to customers, postmortem opening)
- The exact tone profile (calm, specific, no jargon)
- The escalation criteria (when to add direct email; when to post to Twitter; when to call individual high-value customers)

Output ready-to-use copy.

The "investigate publicly" approach is counter-intuitive but compounds trust. A status page that shows "Investigating → Identified the cause: bad deploy → Rolling back → Service restored" reads as a competent team handling a problem honestly. A page that goes silent for 2 hours then says "Resolved" reads as a team trying to hide the issue.

6. Run Postmortems That Actually Improve Things

After every meaningful incident, write a postmortem. Done well, postmortems are the strongest single source of operational improvement; done as performance theatre, they're a waste of time.

Build my postmortem template + cadence.

Trigger: any incident lasting >15 minutes OR affecting >5% of users OR that resulted in any customer-visible data issue.

Within 48 hours of resolution, write:

**Title**: [Date] - [Component] - [One-line summary]

**Sections**:

1. **Summary**: 2-3 sentences. What broke, who was affected, when, and what was the resolution.

2. **Timeline**: with timestamps:
   - Detection (when did we know?)
   - Initial communication (when did the status page first reflect it?)
   - Mitigation start (when did we begin acting?)
   - Resolution (when was service restored?)
   - Notification (when did we tell customers it was over?)

3. **Impact**: quantified — how many users, for how long, what specific functionality, any data implications. NEVER hide impact; honest impact reports build long-term trust.

4. **Root cause**: technical explanation. Not blame — root cause. "We deployed change X, which interacted with state Y in unforeseen way Z" is honest. "Engineer P forgot to add a null check" is unproductive blame.

5. **What went well**: detection, response, communication. There's almost always something to highlight.

6. **What went poorly**: also without blame. "Took 18 minutes to detect because monitoring didn't cover this case" is honest.

7. **Action items**: each one specific, owned, timeboxed.
   - "Add monitoring for [specific case]" — owner, due date
   - "Update runbook for [scenario]" — owner, due date
   - "Add automated rollback when [condition]" — owner, due date

8. **Customer-facing version**: simplified version posted publicly (status page or blog). Acknowledges the incident, the impact, the fix, and ideally an action item that prevents recurrence.

For each postmortem:
- Save in /docs/postmortems/ in the repo (or equivalent)
- Reference it from the runbooks (so the next incident benefits from this one's learnings)
- Quarterly: review all the postmortems together. Are there patterns? Are action items being completed?

The "blameless" principle isn't soft — it's structural. Postmortems that name individual mistakes get hidden, distorted, or never written. Postmortems that name systemic causes get fully written and lead to systemic improvements.

Output: the postmortem template + the customer-facing summary template.

The most underrated outcome of postmortems: they double as customer-trust artifacts. A blog post or detailed status update sharing a postmortem reads as "this team learns from problems and tells customers honestly." Companies that publicize postmortems thoughtfully (Cloudflare, Stripe, GitLab) earn meaningful enterprise trust from the practice.

7. Run Quarterly Drills

Incident runbooks decay. Monitoring goes silently broken. Status page integrations stop working. The only way to know your incident-response system actually works is to test it.

Set up quarterly incident-response drills.

Once per quarter (90 minutes total):

1. **Simulated incident**: pick one of my runbooks (rotate through them). Pretend the trigger fires. Walk through the runbook in real time:
   - Does the alert actually arrive on my phone / Slack?
   - Does the monitoring data needed to assess severity show up correctly?
   - Can I actually roll back / fail over / mitigate as the runbook describes?
   - Does the status page update API work?

2. **Track**: for each step, what worked, what was broken, what's outdated.

3. **Fix**: every broken step gets a follow-up ticket with an owner.

4. **Repeat next quarter**: rotate through the runbooks so each gets tested annually.

Common things drills surface:
- Alert routing has changed (someone left the team, channel renamed, integration token expired)
- Monitoring threshold drifted (false positives mean alerts get ignored)
- Status page API token expired
- Runbook references tools / endpoints that no longer exist
- The "rollback" path doesn't actually work because deploy infrastructure changed

Game-day exercise variant (annual, ~2 hours):
- Schedule with the team in advance
- Have someone (rotate) "fail" a real production component in a deliberate way (e.g., revoke an API key for a specific time period)
- Team responds in real time, no advance notice of WHICH component
- Observer takes notes; debrief afterward

For solo founders: just run the simulated drill yourself. No game-day variant needed at small scale; but the quarterly drill is non-negotiable.

Output: the recurring calendar block + a drill-template doc with checklists per runbook.

The drill is the rule that prevents incident-response from being theatre. Without it, monitoring rots invisibly. With it, problems surface in calm conditions when you can fix them properly.

8. Make Status Page History a Trust Asset

The status page archive is a public record. Curated well, it becomes a trust signal during sales conversations.

Use my status page archive as a trust artifact.

Public-facing patterns:

1. **Incident archive**: every resolved incident stays public. Don't delete past incidents; they prove the team handles things honestly.

2. **Uptime metrics on the page**: 30-day, 90-day, 365-day uptime per component. Calculated correctly:
   - Major outage hours count fully against uptime
   - Degraded hours count partially (e.g., 50%) — honest math
   - DON'T cherry-pick metrics — buyers compare them to competitors

3. **Annual postmortem summary**: once a year, a blog post that summarizes all incidents from the past year. What categories of issues, what we improved, what we're still working on. Rare and powerful trust signal.

4. **Reference during sales**: when a prospect asks "do you have SLAs?" or "tell me about your reliability," point to the status page archive. "Here's our 90-day uptime; here are our last 5 incidents and what we did about them." More credible than any verbal claim.

For B2B-heavy products, also publish:

- **Maintenance windows**: regular scheduled work, communicated in advance
- **SLA commitments**: if you commit to 99.9% uptime, the status page is the canonical proof you're hitting it
- **Incident-response time commitments**: "We acknowledge issues within 10 minutes" — and the page proves it

Anti-patterns to avoid:

- **Burying or deleting old incidents**. If a customer is digging through your history, they'll respect honesty far more than a clean (suspicious) record.
- **Vague status during incidents**. "Performance issues" with no detail looks like hiding. "Database connectivity issues affecting login" looks like operating in the open.
- **Not subscribing customers to updates**. The status page tool should let users subscribe by email or webhook. Promote this — customers who subscribe stay calmer during outages because they're informed.

Output: my approach to public archiving + the annual postmortem-summary template + the sales-conversation script for using uptime as a trust artifact.

The asymmetry on status pages is real: a clean fake-looking page hurts. A transparent honest page with normal incidents and crisp resolution helps. Lean into the second.

Common Failure Modes

"Status page lives on the same infrastructure as the product." Catastrophic during full outages — both go down together, customers can't communicate. Move it to a third-party tool or a separate cloud.

"We never update the status page during incidents." Customers find out from each other; the status page becomes meaningless. Commit to updates within 10 minutes of detection, every 30-60 minutes during active incidents.

"All incidents look the same on our page." Lacking severity differentiation makes the page useless. Implement Operational / Degraded / Partial / Major and use them honestly.

"Our runbooks are outdated." Without quarterly drills, monitoring drifts and runbooks become fiction. Drill quarterly; update after every drill or real incident.

"We tried to fix-forward during an outage." Almost always extends the outage. The rule: rollback first, investigate after. Even if you're confident in the bug, rollback is the safe play.

"We didn't write postmortems for the small incidents." Small incidents are where most operational learning lives. Anything >15 minutes or affecting >5% of users gets one.

"Apologized too aggressively early in the incident." Reads as panic, overcommits to a narrative. Acknowledge calmly during; apologize formally only after resolution, in the postmortem.