Offline-First & Sync Engine: Build Apps That Work Without Internet (and Across Devices)

Offline-First Strategy for Your New SaaS

Goal: Ship a product that keeps working when the network drops, syncs cleanly when it comes back, and stays consistent across a user's laptop, phone, and tablet — without writing a custom CRDT engine from scratch and without the dreaded "your changes were lost" toast. Pick the right sync architecture for your use case (server-authoritative vs. local-first vs. CRDT-merge), choose a sync engine that matches your stack, design conflict-resolution rules explicitly, and treat offline mode as a first-class feature with its own tests, error states, and user-visible signals — not a "nice to have" you'll add later. Avoid the founder traps of building CRUD-over-REST and bolting offline on top (always painful), shipping a "save draft" toggle and calling it offline (real users hit airplane mode, captive portals, and tunnels), or adopting a sync engine without understanding its data-model constraints (then needing a rewrite when you outgrow it).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Decision + architecture sketch in week 1. Single-device offline cache + queue in week 2. Multi-device sync via a sync engine in weeks 3-4. Conflict UI + telemetry + offline-mode QA harness in week 5.

Why "Add Offline Later" Almost Always Fails

The same pattern shows up across founders:

Built REST + React first, tried to bolt on offline. Every form is a network call; every list is fetch() on mount. Adding offline means rewriting every flow to read from a local store, queue mutations, and reconcile. You're effectively rebuilding the data layer. Six months later you've shipped a fragile retry mechanism and called it "offline support."
Used localStorage as the offline cache, no sync engine. Works for single-device, single-user. Falls apart the moment a user switches devices (data missing on the new device), gets logged out (data wiped), or another user edits the same record (last-write-wins silently destroys data).
Used Service Worker caching + manual queue. Reads work offline (cached responses). Writes queue locally. But: no real conflict resolution, no multi-device sync, no real-time updates when online users edit. It's offline-shaped, not offline-correct.
Adopted a real sync engine for the wrong shape of data. Picked Y.js (great for collaborative documents) for an inventory app (which wants server-authoritative pricing). Now the model fights the engine; reconciliation is wrong; bug reports stack up.

The version that works: decide your sync model FIRST (before any code), pick a sync engine matched to that model, design the conflict resolution rules and surface them to users, and build the offline state machine into the data layer — not on top of it.

This guide assumes you have already considered API Pagination Patterns, Optimistic UI Updates (which is a precondition for good offline UX), Idempotency Patterns (offline replays must be idempotent), and have shipped Database Indexing Strategy. Cross-reference Real-Time Collaboration (which often shares infrastructure with offline sync), WebSocket / SSE Implementation, and Multi-Tenancy (sync MUST respect tenant isolation).

1. Pick Your Sync Model FIRST

Three sync models. Different engines, different tradeoffs.

Help me decide which sync model fits my product. The three models:

**Model A: Server-authoritative with offline cache + queue**
- Server is source of truth
- Client reads from a local cache that mirrors server data (last-pulled snapshot)
- Writes happen optimistically locally AND are queued for replay when online
- On reconnect: queued writes replay; server may reject (validation error, conflict)
- Conflict resolution: last-write-wins on the server, OR server returns a 409 and the
  client surfaces a "your change conflicted" UI
- Examples: classic mobile apps, most field-service apps, line-of-business tools

**Model B: Local-first with two-way sync engine**
- Client and server both have the data; sync engine reconciles
- Writes apply locally immediately; sync engine pushes to server in background
- Sync engine handles partial connectivity, retries, ordering
- Conflict resolution: rules per-table OR per-field; usually last-write-wins with timestamps
- Real-time updates from server pushed via WebSocket/SSE
- Examples: Linear, Superhuman, Things, most modern productivity tools
- Engines: Replicache, PowerSync, ElectricSQL, Triplit, RxDB, Zero (Rocicorp)

**Model C: CRDT-based (Conflict-free Replicated Data Types)**
- Data structures designed so concurrent edits merge deterministically
- No conflicts in the traditional sense — every edit is an operation that commutes
- Multiple people can edit the same paragraph simultaneously and the result is a
  coherent merge
- Examples: Figma, Notion blocks, Google Docs, collaborative whiteboards
- Engines: Y.js, Automerge, Loro, Liveblocks (built on CRDTs under the hood)

The match-up:
- **Form-based, single-author records** (a customer, an invoice, a ticket): Model A
- **Multi-device single-user with rich offline** (a productivity app, a notes app): Model B
- **Real-time multi-user collaborative editing** (a shared canvas, a doc, a board): Model C

Most products are NOT C. Don't pick C if your data is records, not documents.

My product:
- What does the user edit? [records / documents / mix]
- Is there real-time multi-user collab on the same record/doc? [yes / no]
- How often is the user offline? [rarely / sometimes / frequently — field workers, mobile,
  international travel, captive WiFi]
- Single-device or multi-device per user? [...]
- Roughly how much data does each user have? [10 records / 10K records / 1M records]
- Do you have hard requirements for "always-correct counts" (e.g., banking)? [yes / no]
  (If yes, Model B/C may be wrong — you need server-authoritative counters)

Tell me:
1. Which model fits my product (be opinionated)
2. Why the other two don't
3. The minimum viable v1 sync I can ship in [2-4] weeks
4. The thing I should explicitly NOT build for v1

Picking heuristic: Start with Model A unless you have a clear reason to need B or C. Model B is great when you have it; the engineering investment is real (4-12 engineering weeks for a custom integration). Model C is right only for true collaborative documents and is its own deep specialty.

2. Build Model A — Cache + Mutation Queue

If you picked Model A, the architecture is: local cache (read), mutation queue (write), reconciliation on reconnect.

I want a server-authoritative offline architecture with a local cache and a
mutation queue. The pattern:

**Client local cache**
- IndexedDB (via Dexie / idb / RxDB / Replicache local storage layer) for
  durable client-side storage
- Schema mirrors the server schema for the entities the user can edit offline
- TTL: data older than [X days] gets refetched on online
- Wiped on logout; per-user keyed by user_id

**Read path**
- UI reads from the local cache (synchronous-feeling)
- A background `syncRefresh()` runs on app start, on reconnect, and every
  N minutes to pull deltas from server
- Server delta API: `GET /sync?since=<lastSyncTimestamp>` returns updated
  records and tombstones for deleted ones
- Last-sync timestamp persisted in IndexedDB

**Write path**
- User submits a mutation; UI applies it optimistically to the local cache
- The mutation is appended to a `mutation_queue` (id, mutation_type, payload,
  client_id, created_at, status, retry_count, last_error)
- A background worker processes the queue:
  - Sends mutations in order (FIFO per entity, parallel across entities)
  - On 2xx: removes from queue; updates the cache with the server response
  - On 4xx (validation): marks as failed; surfaces error to user; doesn't retry
  - On 5xx / network: retries with exponential backoff; up to N attempts; then
    marks as failed
  - On 409 (conflict): see conflict path below

**Conflict path (server-side validation)**
- Server uses optimistic concurrency: every mutable record has a version (or
  updated_at timestamp); client sends the version it read
- On mismatch: server returns 409 with the current server state
- Client UI shows "conflict" with both versions; user picks one or merges
- Merged result re-submitted as a fresh mutation

**Idempotency**
- Every mutation has a client-generated UUID (idempotency key)
- Server stores processed UUIDs for [24h]; replays of the same UUID return
  the original response (NOT a duplicate write)
- See [Idempotency Patterns](idempotency-patterns-chat.md) for the server side

**Observable state**
- "Sync status" indicator in the UI: synced / syncing / offline / error
- Per-record sync state: optimistic / synced / failed
- Failed mutations surface in a "needs your attention" tray; user can retry
  or discard

Build me:
1. The IndexedDB schema (tables, indexes)
2. The mutation queue table + worker logic
3. The `/sync?since=...` endpoint
4. The conflict UI (side-by-side, with "keep mine" / "use theirs" / "merge")
5. The sync-status UI component (icon + tooltip + tray)

Critical: the worker MUST be resilient to the user closing the tab mid-sync.
Never delete a mutation from the queue until the server has confirmed.

Trap to flag: many founders skip the idempotency keys and discover that on flaky networks the server processed the request but the response was lost — so the client retries, creating duplicates. Idempotency keys are not optional.

3. Build Model B — Adopt a Sync Engine

Don't build a sync engine. Adopt one. The major options:

Help me pick a sync engine. The leading options for Model B (local-first
two-way sync):

**Replicache (Rocicorp)**
- Rocicorp's original product; mature; mutator-based architecture
- You write a "mutator" (a function that mutates client state); the engine
  runs it locally and replays it on server
- Server-side: you implement push and pull endpoints with their semantics
- Strengths: most mature; excellent docs; powers Linear; tight latency
- Weaknesses: requires you to build push/pull on the server (some work)
- Pricing: license-based for production
- Languages: TypeScript first

**Zero (Rocicorp's successor)**
- Rocicorp's newer query-sync engine (replaces Replicache for new builds)
- You write SQL-like queries on the client; engine syncs the relevant data
  automatically
- Strengths: dramatically less server-side code; query-driven sync; solid
  for relational apps
- Weaknesses: newer (less battle-tested at scale)
- Pricing: TBD; check current docs

**ElectricSQL**
- OSS Postgres-to-SQLite sync (replicate Postgres directly to clients)
- Strengths: works directly with your existing Postgres schema; OSS;
  particularly nice if you already use Postgres
- Weaknesses: schema migrations require care; sync layer is opinionated
- Pricing: OSS + paid hosted

**PowerSync**
- Hosted sync between Postgres / MongoDB and SQLite/IndexedDB on client
- Strengths: server-side rules engine for what each user syncs; mature
  permissions; cross-platform (web, iOS, Android, Flutter)
- Weaknesses: hosted-leaning (self-host requires more setup)
- Pricing: free tier; usage-based for production

**Triplit**
- Local-first DB + sync; relational + reactive
- Strengths: developer-friendly DSL; built-in permissions; reactive queries
- Weaknesses: smaller community; newer
- Pricing: OSS + paid hosted

**RxDB**
- Open-source local-first DB with multiple sync replication options (pull
  + push, GraphQL, CouchDB-protocol, custom)
- Strengths: huge OSS ecosystem; flexible adapters
- Weaknesses: more DIY than the hosted options; you build more glue
- Pricing: OSS; paid Premium tier

**Convex**
- Real-time backend with built-in reactivity; not strictly a sync engine
  but provides offline-tolerant reactivity at the data-access layer
- Strengths: full-stack solution; ease of use
- Weaknesses: ties you to Convex's runtime; less of a pure sync layer

My constraints:
- Existing backend: [Postgres / Supabase / Convex / MongoDB / Firebase / DIY API]
- Target platforms: [web only / web + mobile native / Electron desktop]
- Estimated row count per user (synced): [<1K / 1K-100K / 100K+]
- Permissions complexity: [simple per-user / shared workspaces / row-level
  multi-tenant rules]
- Team experience with sync engines: [zero / some / extensive]
- Time budget for v1: [2 weeks / 1-2 months / longer]

Tell me:
1. Top 2 picks ranked, with one-line rationale each
2. The wrong pick I should explicitly avoid given my constraints
3. The integration steps for the top pick
4. The query patterns I need to learn (e.g., reactive queries, server-side
   filters)

Decision heuristic:

Already on Postgres + want least vendor lock-in: ElectricSQL or PowerSync
Want lightest server-side burden + relational: Zero
Want most-mature option: Replicache (still excellent if you're OK with the older mutator pattern)
Cross-platform mobile + web: PowerSync
Already on Convex: stay there; their reactive layer covers a lot

4. Conflict Resolution Rules (Make Them Explicit)

Even with a sync engine, you must define what happens when two devices edit the same record before either syncs.

Help me design conflict resolution rules for my data model.

For each entity type, decide:

**Last-write-wins (LWW) by timestamp**
- Default for most user-data fields
- Each field carries a timestamp; sync engine takes the newest
- Risk: silent data loss if two devices edit at the same time
- Acceptable for: titles, descriptions, settings, where the user generally
  doesn't edit the same field on two devices simultaneously

**Last-write-wins by version**
- Server increments a version on write; client must send the version it read
- On mismatch: 409 conflict; client must reconcile
- Use for: any field where silent loss would be a bug (financial amounts,
  status fields, anything user-visible-and-stateful)

**Field-level merge (jsonb / set / counter)**
- Tags / labels: union of both sets
- Counters: increment-based CRDTs
- Use for: collections / counters where "merge both" is the right semantics

**First-write-wins / immutable**
- Once created, can't be changed (created_at, payment_id, etc.)
- No conflict possible

**Manual resolution required**
- Two users edit the same long-form text; can't merge automatically
- Surface a "this conflicted" UI; user picks
- Use for: rich-text fields, files, anything where human judgment matters

For my entities:
[list each entity and decide per-field — for example:
- task.title: LWW timestamp
- task.description: LWW timestamp (or manual if you want safety)
- task.status: LWW version (status changes are stateful)
- task.assigned_to: LWW timestamp
- task.tags: field-level set merge
- task.due_date: LWW version
- task.created_at: immutable
- task.completed_count: counter CRDT (if applicable)
]

Build me:
1. A per-field rules document
2. The server-side reconciliation function
3. The 409 response shape and a client-side handler
4. The conflict-resolution UI (side-by-side, with auto-merge where possible)

Discipline: write the conflict rules document BEFORE you ship offline. The default of "the engine just figures it out" silently destroys data. You will not notice until a customer ragequits.

5. The Offline UX — Make Sync State Visible

Users distrust offline apps. Make the state legible.

Build the offline-mode UX. The components:

**1. Connection status indicator**
- Top-right corner badge: green ("synced") / yellow ("syncing N changes") /
  gray ("offline") / red ("sync error")
- Click → tooltip with details: last sync time, queue length, last error

**2. Per-record sync state**
- Subtle indicator on each record (small dot, italic style, or "…" suffix)
  showing: "saved locally, syncing", "synced", "failed to sync"
- On hover: timestamp of last sync attempt + error if any

**3. Pending changes tray**
- Drawer or modal listing all queued mutations not yet synced
- Each mutation: type (create/update/delete), record name, queued time,
  status (pending/retrying/failed)
- Failed mutations: "retry" button, "discard" button (destructive, with
  confirm)

**4. Conflict resolution UI**
- When a server returns 409 / sync engine flags a conflict
- Side-by-side: "your version" vs. "server version" with a diff
- Buttons: "keep mine", "use theirs", "merge manually" (opens the editor
  with both versions visible)
- After resolution: re-submit as a new write

**5. Network-state messaging**
- When the browser goes offline (`online` / `offline` events): subtle
  banner "Working offline — changes will sync when you reconnect"
- When reconnect: "Syncing N changes…" banner that disappears when queue
  drains
- Don't show modal / blocking UI for offline; the app should feel normal

**6. Stale-data warning**
- If the last sync was >N hours ago, show a soft warning: "data may be
  stale; last synced [time ago]"
- Particularly important for shared/collaborative data

Avoid:
- Toasts on every save / sync — noise
- Modal blocking when offline — frustrating
- Generic "something went wrong" — useless; always include the failed
  mutation context

Build me each of the components above as React components with the state
hooks they need.

Trust-building principle: users who can SEE the sync state trust the system. Users who can't, panic at the first hint of trouble.

6. Server-Side Concerns

The server side of offline sync is non-trivial.

Help me build the server side of an offline-sync architecture. The
requirements:

**1. Delta endpoint**
- `GET /sync?since=<timestamp>&entities=<list>` returns:
  - Records updated since the timestamp
  - Tombstones for records deleted since the timestamp (don't physically
    delete — soft-delete with a `deleted_at` field; tombstones expire
    after [30 days])
  - The current server timestamp the client should pass next time
- Filtered by tenant + user permissions on every row
- Pagination if delta exceeds [1000 records]

**2. Push endpoint**
- `POST /sync/mutations` accepts an array of mutations (idempotency_key,
  type, payload)
- Each mutation processed atomically (transaction per mutation)
- Returns: per-mutation result (success / 4xx error / 409 conflict with
  current state)
- Idempotency: keyed by mutation UUID; replays return cached response

**3. Tombstones for deletes**
- Never hard-delete records that the client may have offline; soft-delete
- Sync deltas include tombstones so clients can drop the record from their
  cache
- Tombstones eventually purged (>30 days; configurable)

**4. Permission enforcement**
- Sync MUST filter by tenant + user permissions; no leaking
- Test: spin up two users in the same workspace with different ACLs;
  verify their sync deltas are correct

**5. Schema migrations**
- When you add/remove/rename columns, clients with old caches will mismatch
- Strategy:
  - Bump a `schema_version` in client cache; on mismatch, force a full
    re-sync (drop cache, pull fresh)
  - Server tolerates old client schemas for [30 days] (graceful degradation)
  - Force-update enforced via a min-supported-version check on requests

**6. Real-time push**
- WebSocket or SSE pushing deltas to online clients
- See [WebSocket / SSE Implementation](websocket-sse-implementation-chat.md)
- Important: push is a hint; clients must still pull authoritative state
  on reconnect

**7. Rate limits**
- Per-user sync push: [N mutations/min]
- Per-user delta pulls: [M/min]
- Prevents runaway clients from DDoSing the sync layer

Build me:
- The delta endpoint with pagination
- The push endpoint with idempotency
- The tombstone strategy
- The schema-version handshake
- The integration test for permission enforcement across users

7. The Offline-Mode QA Harness

Offline bugs are uniquely terrible. They only surface in flaky networks, manifest as data loss, and are hard to reproduce. You need a test harness.

Build an offline-mode test harness for my app. The scenarios:

**Scenario 1: Pure offline write → reconnect**
- Browser online, user creates a record
- Network goes offline (simulate via DevTools or test framework)
- User edits the record
- Network comes back online
- Assert: the edit synced; server state matches client expected state

**Scenario 2: Concurrent edits across two devices**
- Two browser sessions for the same user
- Both go offline
- Each makes a different edit to the same record
- Both come back online
- Assert: conflict resolution applied per the rules; no data lost; user
  sees the conflict UI on both clients (or auto-merge if rules allow)

**Scenario 3: Mid-sync crash**
- Browser online, user creates 50 records rapidly
- Force tab close mid-sync
- Reopen tab
- Assert: all 50 records sync; no duplicates; queue drained

**Scenario 4: Server returns 409**
- Manually craft a state where the server's record is newer than the
  client's
- Client submits an update
- Assert: 409 received; conflict UI shown; resolution writes correct state

**Scenario 5: Permission boundary**
- User A has access to workspace W; User B does not
- User B's sync delta MUST NOT include any records from W
- Assert: validate at the API level + integration test

**Scenario 6: Quota / throttling**
- Submit 1000 mutations in 10 seconds
- Assert: rate limiter applied; client retries with backoff; eventually
  drains

**Scenario 7: Schema mismatch**
- Old client (cached schema v3); server schema v4
- Open the old client
- Assert: client detects mismatch; triggers full re-sync; old data correctly
  imported into new schema

**Tools**:
- Playwright with `route()` and `setOffline()` for browser-side scenarios
- Vitest / Jest for the queue + cache logic in isolation
- Integration tests against a test database for the server endpoints

CI: run scenarios 1-3 on every PR; full suite nightly.

Build me the harness skeleton + the scenario test files.

Discipline: every PR that touches the data layer runs the harness. Without this, offline regressions ship silently and surface only when a customer loses data.

8. What Done Looks Like

You have shipped real offline-first when:

A user can put their laptop in airplane mode for 4 hours, work continuously, reconnect, and find every change synced.
Two devices edit the same record offline; on reconnect, conflicts surface clearly and resolve without silent data loss.
Sync state is visible at all times: connection status, per-record state, pending-changes tray.
Network failure produces NO error toasts that say "something went wrong"; the app keeps working.
Mid-sync crashes do not produce duplicate records or lost mutations.
Per-tenant permission isolation is enforced at the server sync layer (tested).
Schema migrations don't break existing clients (graceful re-sync).
The QA harness runs in CI on every PR; offline scenarios pass.
A new engineer can read this doc + the conflict rules + the sync engine config and understand the data flow end-to-end.
Ops alerts fire when sync queue lag exceeds [N seconds] across the user base.

Mistakes to Avoid

Adding offline as an afterthought. Architectural rewrite. Decide upfront.
Picking the wrong sync model. CRDTs for record-style data; REST + queue for collaborative docs. Both painful for years.
No idempotency keys. Duplicates everywhere on flaky networks.
No conflict-resolution UI. Either you silently destroy data or you block the user mid-flow with a generic error.
Hard-deleting records. Clients with offline caches keep the deleted record; on next sync, no signal that it's gone.
No schema-version handshake. First migration that adds a NOT NULL column breaks every offline client.
Bypassing tenant isolation in the sync layer. Cross-tenant data leakage in the delta endpoint is the worst kind of bug.
Building a sync engine in-house. Y.js, Automerge, Replicache, ElectricSQL, PowerSync exist for a reason. Adopt; don't build.
Blocking UI when offline. "You're offline — please reconnect" modals destroy the value of offline-first.
Not testing flaky networks. Slow / lossy / partitioned networks reveal bugs that pure online/offline tests don't.
Storing auth tokens in IndexedDB without encryption. Anyone with local access reads them. See Session Management Patterns.
Not surfacing failed mutations. Failed-and-forgotten mutations = silent data loss. Surface them prominently.
Treating WebSocket push as authoritative. Push is a hint; the client must reconcile against pulled state on reconnect.