Backups and Disaster Recovery: Survive the Day Things Go Wrong

Backups and Disaster Recovery for Your New SaaS

Goal: Build a backup and disaster-recovery system that actually works when you need it — not the kind that exists on paper, has never been restored, and fails the first time it matters. Get the RPO (recovery point objective) and RTO (recovery time objective) numbers in writing, test the restore process quarterly, and document for security reviews.

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Backup architecture decision in 1 day. Provider-managed backups verified in week 1. First restore drill in week 2. Quarterly drill cadence baked into calendar from launch onward.

Why Most Founder Backup Plans Fail When Tested

Three failure modes hit founders the same way:

The "my database provider does backups" assumption. Founder reads "Supabase / Neon / RDS includes daily backups" and concludes the backup problem is solved. Six months later, an engineer accidentally drops a production table. Founder tries to restore from yesterday's backup; discovers the backup includes the dropped table (because it ran 10 minutes after the drop). Discovers there's no point-in-time recovery configured. Discovers the customer-blob storage has no backups at all. Spends 3 days writing apologies and reconstructing data from logs.
Backups exist but have never been restored. Backups run nightly. The team feels secure. The day a real restore is needed, it turns out the restore process takes 6 hours, requires a permission the team's IAM doesn't have, and corrupts foreign keys halfway through because nobody documented the table-creation order. The backup existed; the recovery didn't.
Single-provider single-region single-tenant backups. Everything lives in one AWS region, backed up to the same region, accessible only via the same root account. The day Amazon has a regional outage (yes, it happens), or a credential leak causes attackers to delete the backups, the founder has nothing.

The version that works is structured: define explicit RPO/RTO targets, layer backups across application data + customer files + auth + secrets, test restores quarterly with documented procedures, and store at least one backup copy outside the primary blast radius.

This guide assumes you have already done Data Trust (backups are a trust artifact), have shipped Audit Logs (logs help with incident reconstruction), and have Multi-Tenant Data Isolation (per-tenant restore is sometimes needed).

1. Define RPO and RTO Explicitly

Before any technical decisions, decide what you're optimizing for. Most founders never write these down.

You're helping me define recovery objectives for [your product] at [your-domain.com]. The product is [one-sentence description] with [N paying customers / pre-launch].

Two key numbers:

**RPO (Recovery Point Objective)**: how much data are we willing to lose in a worst-case scenario?
- 0 (zero data loss): synchronous replication, very expensive, almost never required for SaaS
- 5 minutes: streaming replication, cheap with managed Postgres
- 1 hour: hourly snapshots, cheaper
- 24 hours: nightly backups only, cheapest
- 7 days: weekly only, almost never acceptable for paying customers

For most indie B2B SaaS in 2026: 5-minute RPO via point-in-time recovery is the default. Customers expect ≤1 hour of data loss in the worst case; 5 minutes is the safe operating margin.

**RTO (Recovery Time Objective)**: how fast must service be restored?
- 1 hour: dedicated DR runbook, automated failover
- 4 hours: documented restore process, manual execution
- 24 hours: managed-provider snapshot restore + reconfiguration
- 7 days: very rare; only acceptable for non-critical data

For most indie B2B SaaS in 2026: 4-hour RTO for the application database is the default. Below 4 hours requires automated DR; above 4 hours starts to break SLA expectations.

Output:
1. The RPO and RTO targets for my product, with rationale
2. The differential RPO/RTO if applicable (e.g., "core data: 5 min RPO; user-uploaded files: 24h RPO; analytics: 7-day RPO is fine")
3. The customer-facing commitment (what I'll publish on the trust page; never overpromise)
4. The internal commitment (what I actually engineer for; usually tighter than what I publish to leave margin)

Sanity check: if my customers are individuals using the product casually, the RPO/RTO can be looser. If my customers run business-critical workflows on my product, the targets must be tighter. Calibrate to actual customer impact, not engineering aspiration.

Three principles I've watched founders re-learn:

Pick differential RPO per data category. Application database needs 5-minute RPO; user-uploaded blobs are usually fine at 24h; analytics events can lose a day without anyone noticing. Engineering at uniform 5-min RPO across everything is expensive and unnecessary.
Engineer for tighter targets than you publish. Customer-facing trust page says "≤4 hour RTO"; internal target is 1 hour. The gap absorbs the day a restore takes longer than expected.
Numbers without testing are theater. A claimed 4-hour RTO that's never been tested is fiction. The first untested restore takes 12+ hours every time.

2. Layer the Backup Strategy Across Data Categories

Different data has different backup needs. Don't apply one blanket policy.

Help me design the layered backup strategy. For each data category, determine the backup approach.

**Layer 1: Primary application database (Postgres / your database)**
- Source of truth for accounts, users, business records
- Loss = catastrophic
- Backup approach:
  - Provider-managed automated backups (daily, ideally with 30-day retention)
  - Point-in-time recovery (PITR) enabled — gives 5-minute RPO
  - Daily logical backup (pg_dump or equivalent) to separate storage region
  - Weekly cold copy to second cloud provider or off-cloud storage (defense against credential compromise)
- Provider examples:
  - Supabase: PITR available on Pro+; configure retention; export weekly to Vercel Blob / S3 / B2
  - Neon: branching = restore-to-point-in-time; export weekly to separate location
  - Amazon RDS: automated backups + cross-region snapshot + weekly logical export
  - PlanetScale: backups + Vitess-level PITR + weekly export

**Layer 2: User-uploaded files (object storage)**
- Customer-uploaded media, documents, AI-generated outputs
- Loss = customer-facing pain (file deletion is a per-customer incident, not a company-ending one)
- Backup approach:
  - Object versioning enabled (S3 / R2 / Vercel Blob support this)
  - Lifecycle rules: keep 30 days of versions; archive older to cold storage
  - Cross-region replication for production data
  - Audit-logged deletes per [Audit Logs](audit-logs-chat.md)
- For Vercel Blob, R2, S3: turn on object versioning + replication

**Layer 3: Authentication / identity layer**
- User credentials, OAuth tokens, session data
- Loss = customers locked out
- Backup approach:
  - If using a managed auth provider (Clerk, Supabase Auth, Better Auth): they handle backups
  - Verify with provider what their RPO/RTO is
  - Export user lists (email + ID) periodically for emergency reconstitution

**Layer 4: Secrets and configuration**
- API keys, environment variables, third-party tokens
- Loss = service interruption + reissue requirements
- Backup approach:
  - Store in a secrets manager (Doppler, 1Password, AWS Secrets Manager)
  - Export an encrypted backup periodically to separate storage
  - Document the rotate-on-compromise procedure

**Layer 5: Analytics / observability data**
- Event logs, metrics, traces
- Loss = analysis gaps but not service interruption
- Backup approach:
  - Provider retention is usually sufficient (PostHog: configurable; Axiom: per-plan)
  - Don't over-engineer; this layer can tolerate days of loss
  - Skip cross-region for this category unless contract requires

**Layer 6: Code and infrastructure-as-code**
- Source code, deployment configs, infrastructure definitions
- Loss = team productivity (rebuild from scratch)
- Backup approach:
  - Git is the backup; ensure GitHub / GitLab account has multiple owners
  - Periodic local clones to separate machine
  - Optionally: mirror to a secondary git host (e.g., Codeberg, self-hosted Gitea)

Output:
1. The full layered table with provider, RPO/RTO target, backup mechanism, retention period
2. The cross-cloud / cross-region copies (the "outside-blast-radius" copies)
3. The backup verification plan (covered in step 4)
4. The cost estimate for the full backup architecture

The most-skipped layer: secrets and configuration. Founders backup the database religiously and forget that all their API keys are in one place — and the day that place is compromised, they have no recovery path that doesn't involve rotating 30 third-party tokens.

3. Configure Point-in-Time Recovery (PITR) for the Primary Database

PITR is the single most consequential backup feature. Most providers support it; turn it on.

Help me configure PITR for [your database — Supabase / Neon / RDS / your provider].

PITR allows restoring the database to any specific moment within a retention window — useful for recovering from accidental drops, bad migrations, or data corruption that's noticed within a window.

Configuration by provider:

**Supabase**:
- Pro plan or above
- Settings → Database → Point-in-Time Recovery → enable
- Retention: 7 days default; up to 35 days on Team plan
- Restore: contact support (not self-serve in 2026); ~2-4 hour restore time

**Neon**:
- Branching feature acts as PITR — create a branch from a specific timestamp
- Retention is tied to plan tier (7-30 days typical)
- Restore: instant via branch creation; can promote branch to primary

**Amazon RDS Postgres**:
- Automated backups enabled by default; PITR retention 1-35 days
- Restore: AWS Console → restore-from-time → new instance created (don't restore over existing)
- ~30 minutes to several hours depending on database size

**PlanetScale**:
- Backups + Vitess-level recovery
- Branching for staging-from-production
- Retention varies by plan

**Self-hosted Postgres**:
- pg_basebackup + WAL archiving
- Point-in-time restore via recovery.conf
- Operationally complex; only worth it if managed-DB-cost is genuinely too high

Verification:
1. PITR is configured (verified in console)
2. Retention period is acceptable for my RTO/RPO
3. The provider's documented restore time is within my RTO
4. The team knows the restore procedure (documented runbook)

Output:
1. The PITR configuration steps for my provider
2. The verification checklist
3. The runbook for "we need to restore now" — copy-pasteable for the on-call engineer

The single most useful configuration: set retention longer than you think you need. Many production-affecting issues aren't noticed for 3-7 days (a slow data corruption, a subtle bug in a migration). 30-day PITR retention costs marginally more than 7-day and saves you the day you need it.

4. Test Restores Quarterly — From Backups, Not Aspirations

A backup that hasn't been restored is a hypothesis. Test it.

Design the quarterly restore drill.

The drill, every 90 days:

**Step 1: Pick the scenario**
Rotate scenarios so you cover them all over a year:
- Q1: Restore primary database to PITR point 24 hours ago
- Q2: Restore individual table from logical backup (single-table corruption scenario)
- Q3: Restore object-storage file from older version (accidental delete scenario)
- Q4: Reconstitute auth provider from exported user list (auth-provider catastrophic scenario)

**Step 2: Run the drill on a non-production environment**
- Spin up a staging database
- Apply the restore procedure as if it were prod
- Time the entire process from "we need to restore" → "service is back up"
- Document what worked, what was confusing, what took longer than expected

**Step 3: Validate the restored data**
- Read a sample of records from the restored DB
- Verify referential integrity (foreign keys still resolve)
- Verify no PII leaked across tenant boundaries (per [Multi-Tenant Data Isolation](multi-tenancy-chat.md))
- Verify audit logs are intact

**Step 4: Update the runbook**
- Whatever was confusing or slow during the drill gets documented
- The next drill should be faster because the runbook is better

**Step 5: Calculate actual RTO**
- Compare drill RTO to claimed RTO
- If the claim drifted (it usually does), update the trust page and internal commitments

Output:
1. The 4-quarter scenario rotation
2. The drill checklist for each scenario
3. The validation queries to confirm restored data integrity
4. The runbook update template
5. The schedule: calendar block 4 hours, once per quarter, owner = founder or engineering lead

Critical: never run a drill that touches production. The drill is on staging; the production restore happens only when production is actually broken.

Three principles:

Untested backups are claims, not capabilities. The first time you test, the drill reveals 3-5 things that don't work. The second time, those are fixed. By the third drill, restoration is routine.
Time the drill end-to-end. "Restore from backup" is not the metric. "Service is back up and validated" is the metric. The first usually takes 30 minutes; the second takes 4 hours the first time.
Document the surprises. The first drill reveals what the team didn't know. The runbook captures it for the day a real incident happens at 2am.

5. Backup the Backups: Outside-Blast-Radius Copies

Single-cloud, single-region, single-account backups have one critical flaw: a credential compromise or provider outage takes them all out.

Design the outside-blast-radius backup strategy.

The principle: at least one backup copy must exist outside the primary cloud + account + region. So that no single credential leak or provider outage erases all backups.

Three approaches:

**Approach A: Cross-region within the same cloud**
- Daily snapshot replicated from us-east-1 → us-west-2 (or eu-central-1)
- Cheaper than cross-cloud
- Defense against regional outage
- NOT defense against credential compromise (same AWS account)

**Approach B: Cross-cloud replication**
- Primary backup in AWS S3
- Weekly mirror to GCP Cloud Storage or Backblaze B2
- More expensive
- Defense against single-cloud outage AND single-cloud credential compromise

**Approach C: Off-cloud cold storage**
- Encrypted weekly export uploaded to off-cloud cold storage (Backblaze B2 with offline IAM, or even physical media for paranoid cases)
- Cheapest defense against full credential compromise
- Slower restore from cold storage

For most indie SaaS in 2026:
- Pre-revenue → first 100 customers: Approach A is sufficient
- $1K-$50K MRR: Approach A + Approach C (B2 weekly cold copy with separate credentials)
- $50K+ MRR: Approach B + Approach C, automated and audited

Output:
1. The chosen approach for my stage
2. The implementation: which jobs run, on what schedule, to what destination
3. The credential isolation: the backup-write credential should NOT have read access to anything else
4. The encryption: backups encrypted at rest with a key not stored in the same account
5. The cost estimate (cross-cloud egress is the main expense)

The most overlooked detail: credential isolation for backup writers. If your backup writer has the same IAM permissions as your application, a compromised app token can read AND delete your backups. The backup writer should be a separate, narrowly-scoped identity that can WRITE to the backup destination but cannot DELETE or READ existing backups.

6. Plan for Common Disaster Scenarios

Not all disasters are equal. Plan for the realistic ones; don't waste time on the implausible ones.

For each likely disaster scenario, document the response.

**Scenario 1: Accidental data deletion (one customer, one table)**
- Frequency: 1-3x per year for active SaaS
- Detection: customer reports, anomaly alert, support ticket
- Response: PITR restore to a staging instance; extract the deleted records; replay into production
- RTO target: 4 hours
- Communication: customer-specific apology email; usually no broader disclosure needed

**Scenario 2: Bad migration corrupts data**
- Frequency: 1-2x per year for fast-shipping team
- Detection: tests fail in production; error rate spikes; customer reports
- Response: roll back the migration code immediately; PITR restore the affected tables; replay any user-driven writes that happened post-migration
- RTO target: 1-2 hours
- Communication: status page incident per [Status Page](status-page-chat.md); post-mortem within 3 days

**Scenario 3: Database service outage (provider down)**
- Frequency: ~1x per year for managed providers (rare but happens)
- Detection: monitoring alerts (per [Observability Providers](https://www.vibereference.com/devops-and-tools/observability-providers))
- Response: provider-driven recovery; you wait + communicate; in extreme cases, restore to alternate region
- RTO target: depends on provider; usually 1-4 hours
- Communication: status page; transparent updates per [Status Page](status-page-chat.md)

**Scenario 4: Credential / account compromise**
- Frequency: hopefully zero, but plan for it
- Detection: anomalous access logs, alerts from auth/audit
- Response: rotate ALL credentials; lock affected accounts; restore from backup if data was tampered; legal/regulatory notification per [Data Trust](data-trust-chat.md)
- RTO target: depends; can be days for credential rotation across all third-parties
- Communication: customer notification within regulatory timeframes (72h for GDPR)

**Scenario 5: Regional cloud outage**
- Frequency: ~1-2x per year for major providers
- Detection: provider status page; your monitoring fires
- Response: usually no action — wait for recovery; document the impact for customers
- RTO target: provider-driven (often hours)
- Communication: status page; blame the provider transparently if relevant

**Scenario 6: Ransomware / data destruction attack**
- Frequency: rare but increasing; specifically affects unprepared SaaS
- Detection: encrypted/wiped data alerts
- Response: restore from outside-blast-radius backup; do NOT pay ransom (legal + sets precedent); rebuild from clean backups
- RTO target: 24-48 hours for full restoration
- Communication: customer notification; law enforcement; regulatory disclosure

**Scenario 7: Human error (rm -rf / DROP DATABASE)**
- Frequency: 1-2x per year for any team
- Detection: immediate
- Response: PITR restore; cooler heads investigate "why was that command possible to run"
- RTO target: 1-4 hours
- Communication: status page if customer-affecting; post-mortem

For each scenario, output:
1. The detection mechanism (alerting + manual)
2. The named owner (who runs the response)
3. The runbook step-by-step
4. The communication template
5. The post-mortem trigger criteria

The two scenarios most teams skip planning for: credential compromise and human error. The first has serious legal/regulatory dimensions; the second is the most likely incident. Plan both.

7. Document for Buyers and Auditors

Enterprise buyers and SOC 2 auditors expect documentation. Provide.

Generate the backup and disaster recovery documentation. Lives at /trust/backups (linked from [Data Trust](data-trust-chat.md)).

Sections:

**1. Recovery objectives**
- Published RPO and RTO per data category
- Differential commitments (DB tighter than blob storage tighter than analytics)

**2. Backup approach**
- Layered strategy: primary database with PITR + daily logical exports + weekly cross-region/cross-cloud copies
- Object storage: versioning + lifecycle + cross-region replication
- Auth, secrets, code: separate strategies summarized

**3. Retention**
- PITR window: [N] days
- Logical backups: [N] days local + [N] days cross-region + [N] days cold storage
- Audit logs (per [Audit Logs](audit-logs-chat.md)): retention per tier

**4. Testing**
- Quarterly restore drills
- Documented in /docs/dr-drill-Q[N]-[year].md
- Latest drill date and outcome publicly visible (or available on request)

**5. Encryption**
- Backups encrypted at rest
- Encryption keys stored separately from the backups themselves
- KMS / equivalent for key management

**6. Access controls**
- Backup-writer credentials separate from application credentials
- Backup deletion requires multi-person authorization
- All backup access audit-logged

**7. Compliance mappings**
- SOC 2 CC9.x (risk mitigation, business continuity)
- ISO 27001 A.17 (business continuity)
- Specific obligations for HIPAA / FedRAMP if applicable

**8. Incident scenarios with response timelines**
- The 7 scenarios from step 6, each with response template

**9. Customer-initiated recovery**
- Can customers self-serve restore (for their own account)?
- If yes, the API/UI for it
- If no, the SLA for support-driven restore

**10. Limitations and caveats**
- What we don't yet do (be honest)
- Roadmap for upgrades

Output the documentation in the same voice as the rest of /trust.

The single highest-leverage section: published RPO/RTO commitments. Enterprise buyers' security review forms ask for these directly. If your trust page says "RPO: 5 minutes; RTO: 4 hours" with backing detail, the auditor checks the box and moves on.

What Done Looks Like

By end of week 2 of building a real backup system:

RPO and RTO documented per data category
PITR enabled on the primary database with appropriate retention
Object versioning + replication enabled on file storage
Outside-blast-radius copy running weekly to a separate cloud/region
First restore drill completed with timing measured
Credential isolation for backup-writer identity
Documentation on the trust page

Within 90 days:

1 quarterly drill completed; runbook updated based on findings
Cross-cloud / cross-region backup verified working
Incident-response runbooks drafted for each of the 7 scenarios
Compliance documentation drafted for SOC 2 / ISO 27001 mappings

Within 12 months:

4 quarterly drills, each faster than the last
Customer-facing trust commitments matched by operational reality
Zero "backup didn't work when needed" incidents
Enterprise security review passes citing the backup architecture

Common Pitfalls

Trusting "provider does backups" without verification. Verify retention, restore time, and what's actually backed up. Read the provider's docs, not their marketing.
Skipping the restore drill. Untested backups are theoretical. The first drill reveals 3-5 surprise problems.
Single-region single-cloud backups. A regional outage or credential compromise takes everything.
No outside-blast-radius copy. Defense against credential compromise requires backups the attacker can't reach.
Backup writer with the same IAM as the app. Separate identity with narrowly-scoped permissions.
Forgetting object storage. "Backups" usually means database. Files matter too.
Forgetting secrets / configuration. API keys in one place that gets compromised = service interruption.
Confusing RPO and RTO publicly. Customers don't know the terminology; explain in plain language on the trust page.
Aspirational RTO claims. Don't publish "1 hour RTO" when you've never tested below 8 hours. Match claims to operational reality.
No human-error recovery plan. The most common incident type. Plan for it.

Where Backups & DR Plug Into the Rest of the Stack

Data Trust — backup commitments are part of the trust artifact set
Multi-Tenant Data Isolation — per-tenant restore is sometimes needed; tenant deletion paths must work
Audit Logs — audit logs themselves need backup; also help reconstruct what happened during an incident
Incident Response — backups are the "fix" half of incident response
Status Page — communicate incidents during recovery
Database Providers — provider determines which backup features are available
File Storage Providers — versioning + replication features matter for object storage
Observability Providers — detection mechanism for backup-relevant incidents
Customer Support — support handles customer-reported "I lost my data" tickets
Background Jobs Providers — backup jobs run on the same infrastructure

What's Next

Backups are like insurance: you pay every month, hate paying every month, and feel grateful exactly once when something goes wrong. The team that builds the backup architecture deliberately in week 1 of launch ships through every disaster scenario without losing a customer; the team that defers it loses customers (and sometimes companies) the day they hit their first major incident.

Build the discipline now. The architecture decision is small; the quarterly drill is 4 hours; the documentation is a half-day. The compounding payoff: no 2am crisis without a runbook, no enterprise procurement blocked on "show us your DR plan," no nightmare emails to customers explaining why their data is gone.

⬅️ Growth Overview