Caching Strategies: Layers, Invalidation, TTLs, and Shipping Caches Without Stale-Data Bugs

⬅️ Day 6: Grow Overview

If you're running a SaaS in 2026, the cache decisions you make now will dictate how the product feels at 10x traffic. Most founders skip caching too long, then panic-add Redis after the first slow page complaint, then discover three months later that customers are seeing each other's data because the cache key didn't include tenant_id. The cache layer is one of the highest-leverage and highest-risk parts of your stack — fast, mostly invisible when it works, and capable of producing the worst class of bugs (stale data, leaked data) when it doesn't.

A working caching strategy answers: which layer caches what, what's the TTL per layer, what triggers invalidation, and how do we avoid leaking one tenant's data to another. Done well, the app feels instant and the database stays calm. Done badly, you're on a Sunday call with a Sev-1 tenant-leak incident wondering how a cache miss became a security incident.

This guide is the implementation playbook for caching that scales — the layered architecture, the invalidation patterns, the TTL math, and the rules that prevent stale-data and cross-tenant disasters.

The Cache Pyramid: Five Layers, Each With a Job

Caching isn't one decision; it's five. Each layer has a different latency, a different scope, and a different invalidation pattern. Get the layer assignment right; everything else follows.

Help me design the cache layers.

The five layers (top to bottom):

**1. Browser cache (client-side)**

- Lives in the user''s browser
- Set via Cache-Control / ETag headers
- TTL: minutes to days
- Scope: per-user (good); doesn''t help cold visitors

Use for:
- Static assets (JS, CSS, images, fonts) — long TTL, immutable
- API responses that are user-specific and rarely change

Don''t use for:
- Anything sensitive that shouldn''t persist after logout
- Data that changes frequently (defeats the purpose)

**2. CDN edge cache**

- Lives on CDN PoPs (Cloudflare, Vercel, CloudFront — per [cdn-providers](https://www.vibereference.com/cloud-and-hosting/cdn-providers))
- Set via Cache-Control / Surrogate-Key
- TTL: minutes to hours (for dynamic) or weeks (for assets)
- Scope: global (anyone in same region gets cached response)

Use for:
- Public marketing site
- Static assets
- Public API responses (be careful — see "tenant isolation" below)
- ISR pages (Next.js / etc.)

Don''t use for:
- Authenticated, per-user content (without scoped keys)
- Tenant-private data without strong key isolation

**3. Application-level cache (in-process)**

- Lives in your Node / Python / Go process memory
- TTL: seconds to minutes (process restarts wipe it)
- Scope: per-process

Use for:
- Hot config / feature flags
- Computed values (expensive joins / aggregations)
- Rate-limit counters (when not distributed)

Don''t use for:
- Anything that must be consistent across instances
- Large data (memory pressure)

**4. Distributed cache (Redis / Memcached)**

- Lives in shared cache server (per [database-providers](https://www.vibereference.com/backend-and-data/database-providers))
- TTL: seconds to days
- Scope: shared across all instances
- Atomicity (Redis): pipelines, transactions, Lua scripts

Use for:
- Session data
- API rate limiting
- Job queues / pub-sub
- Cross-instance shared cache
- Computed data with complex invalidation

Don''t use for:
- Source of truth for revenue / inventory data (use database)
- Anything you can''t reproduce from origin if cache fails

**5. Database query cache / materialized views**

- Lives in your DB (Postgres pg_stat_statements, Materialized Views) or DB-adjacent (PgBouncer)
- TTL: depends on refresh strategy
- Scope: query-level

Use for:
- Expensive aggregations refreshed periodically
- Read-heavy reports
- "Top N" rankings, dashboards

Don''t use for:
- Real-time data (refresh latency)

**The pyramid mapping**:

| Use Case | Best Layer | TTL | Why |
|---|---|---|---|
| Static asset (JS / CSS / image) | Browser + CDN | Days-weeks | Immutable; far edge wins |
| Marketing page | CDN | Hours-days | Public; rarely changes |
| Authenticated dashboard data | Distributed (Redis) | Seconds-minutes | Per-tenant isolation |
| Hot config / feature flag | Application | Seconds | Low write rate |
| Session data | Distributed | Hours-days | Shared across instances |
| Computed leaderboard | Database materialized view | Hourly refresh | Heavy compute |
| API response (public) | CDN | Minutes | Cacheable for many users |
| API response (authenticated) | Distributed | Minutes | Tenant-scoped |
| Rate-limit counter | Distributed | Window | Shared state |

For my app:
- The 5 layers and what each holds
- The TTL per layer
- The invalidation strategy per layer

Output:
1. The cache architecture diagram
2. The data classification (public / private / static / dynamic)
3. The layer assignment per data type
4. The TTL table

The biggest unforced error: caching authenticated tenant data at the CDN without scoping the cache key. A user signs in, hits /api/dashboard, the response gets cached at CDN, and the next user hitting the same URL gets the previous user''s data. This is a tenant-leak incident. The fix: never cache authenticated responses at CDN unless the cache key explicitly includes user / tenant identity (Surrogate-Key, custom header) AND Cache-Control: private. When in doubt, mark Cache-Control: private, no-store and cache at distributed (Redis) instead.

TTL Math: Pick the Number, Justify It

Every cache TTL has a tradeoff: shorter = fresher data, more origin load; longer = staler data, less origin load. Pick deliberately.

Help me set TTLs systematically.

The TTL framework:

**Step 1: Categorize the data**

| Category | Update frequency | Example | TTL band |
|---|---|---|---|
| Immutable | Never | Versioned JS bundle | 1 year (max) |
| Static | Rarely | Marketing copy | 1 hour - 1 day |
| Slow-changing | Hourly | Pricing tier list | 5-15 minutes |
| Medium-changing | Per-minute | Dashboard metrics | 30-60 seconds |
| Fast-changing | Per-second | Active-user list | 5-10 seconds (or no cache) |
| Real-time | Sub-second | Trading prices | No cache; SSE / WebSocket |

**Step 2: Compute origin-load impact**

Without cache: every request hits origin.
With cache + TTL T: origin gets at most 1 request per TTL window per cache key.

Example:
- Endpoint gets 1000 req/sec across all users
- Cache key per user: 100 active users
- TTL = 60 seconds
- Origin load = 100 requests / 60 seconds = ~1.7 req/sec to origin
- Reduction: 99.8%

**Step 3: Compute stale-data impact**

Worst case: data was updated at T=0; cache TTL is 60s; user hits at T=59.

How bad is 59 seconds of staleness?
- Marketing copy: fine
- Dashboard analytics: probably fine
- Inventory count: borderline
- Account balance: NOT fine
- Permission change: dangerous (security implication)

**Step 4: Pick TTL based on tolerance**

- Stale-tolerant: long TTL (minutes to hours)
- Stale-sensitive: short TTL (seconds) or invalidate on write
- Stale-intolerant: no cache; or invalidate-on-write strict

**The "p99 freshness" rule**:

For 99% of users, TTL/2 is the average staleness they''ll see.

If you can''t tolerate average staleness > X seconds, set TTL = 2X.

**Specific TTL guidelines**:

| Data | Suggested TTL |
|---|---|
| User profile | 60-300 sec (invalidate on write) |
| Settings / preferences | 60 sec (invalidate on write) |
| Permission / role data | 0 (no cache) or short with strict invalidation |
| Feature flags | 30-60 sec |
| Content (blog post, doc) | 5-15 min (invalidate on publish) |
| Marketing page | 1-24 hr (invalidate on deploy) |
| Search results | 1-5 min |
| Aggregated metrics | 1-5 min |
| Pricing tier definitions | 5-15 min |
| Static asset (versioned) | 1 year |
| Static asset (unversioned) | 1 hour |

**The "explain this TTL in 1 sentence" rule**:

For every cache TTL, you should be able to say:
> "We cache X for Y seconds because Z change at most every Y/2 seconds, and customers tolerate Y/2 seconds of staleness."

If you can''t: rethink. Random TTLs (300, 600, 900) without justification compound into bugs.

For my app:
- TTL per cached data type
- Justification per TTL (the 1-sentence test)
- The explicit "no cache" list

Output:
1. The TTL table with justifications
2. The "no cache" explicit list
3. The TTL review cadence

The biggest TTL mistake: picking 1-hour TTLs because "they sound reasonable" without checking what data updates more often than that. A user-permission change that takes 60 minutes to propagate is a 60-minute security exposure. A pricing update that''s stale for 1 hour is 1 hour of customers seeing wrong prices. TTLs need empirical grounding: how often does this data ACTUALLY change, and what''s the cost of staleness?

Invalidation: The Hard Part

"There are only two hard problems in computer science: cache invalidation and naming things." Invalidation is the part that bites. Plan for it.

Help me design cache invalidation patterns.

The four invalidation strategies:

**Strategy 1: TTL-based (lazy invalidation)**

- Cache expires after fixed time
- Reads after expiration miss → fetch from origin → re-cache
- No active invalidation; just wait

Pros: simple; no infrastructure
Cons: stale data within TTL window; not suitable for must-be-fresh data

Use for: most static / semi-static data

**Strategy 2: Write-through invalidation**

- Cache key gets invalidated when underlying data changes
- On write, application explicitly purges the cache key
- Next read repopulates

Pros: fresh after writes
Cons: requires app to know all cache keys; risk of missing invalidation paths

Use for: user profile, settings, anything must-be-fresh-after-update

**Strategy 3: Tag-based invalidation**

- Each cache entry has tags (e.g., `user:123`, `tenant:abc`, `feature:billing`)
- On data change, purge all entries with matching tag
- One write can invalidate many keys

Pros: handles complex invalidation cleanly
Cons: requires cache backend that supports tags (Cloudflare Cache Tags, Vercel cacheTag, Redis with tag mappings)

Use for: complex data with multiple read paths

**Strategy 4: Event-driven invalidation**

- Database triggers / change-data-capture (CDC) emits events
- Cache listener invalidates affected keys
- Scales to many readers

Pros: automatic; no app-level coupling
Cons: requires CDC infrastructure (Debezium, Postgres logical replication)

Use for: large-scale systems with many reader services

**The Vercel-native pattern (Next.js 15 / 16)**:

Vercel''s `cacheTag` + `updateTag` (per [vercel-runtime-cache](https://www.vibereference.com/cloud-and-hosting/vercel-functions)) gives you tag-based invalidation natively:

```typescript
// In server function
import { cacheTag, updateTag } from 'next/cache';

export async function getUser(userId: string) {
  'use cache';
  cacheTag(`user:${userId}`);
  return await db.users.findById(userId);
}

// On update
export async function updateUser(userId: string, data: any) {
  await db.users.update(userId, data);
  updateTag(`user:${userId}`);  // invalidates cached user
}

Pattern: tag-on-read; invalidate-tag-on-write.

The Redis-native pattern:

// Read
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);
const user = await db.users.findById(userId);
await redis.set(`user:${userId}`, JSON.stringify(user), 'EX', 300);
return user;

// Write
async function updateUser(userId: string, data: any) {
  await db.users.update(userId, data);
  await redis.del(`user:${userId}`);  // invalidate
}

For multi-key invalidation (tag-based on Redis):

// Tag mapping: SADD "tag:tenant:123" "user:abc" "user:def" ...
// On invalidate:
const keys = await redis.smembers('tag:tenant:123');
if (keys.length > 0) {
  await redis.del(...keys);
  await redis.del('tag:tenant:123');
}

The "every write invalidates" rule:

Every code path that writes to data X must invalidate the cache for X.

This is non-negotiable. Code review checklist:

Direct DB write: invalidate cache?
Background job updates: invalidate cache?
Admin tool changes: invalidate cache?
Webhook receives external change: invalidate cache?
Migration backfill: invalidate cache?

Missed invalidations are the #1 cache bug.

Anti-patterns:

"We''ll just rely on TTL for must-be-fresh data" — staleness becomes a feature complaint
Manual cache flushes via SSH ("just nuke Redis if anything''s wrong") — not a strategy
Different services invalidate cache differently — drift; bugs
Cache keys without consistent naming — can''t invalidate what you can''t enumerate

For my app:

The invalidation strategy per data type
The "every write" audit
The cache-key naming convention

Output:

The invalidation strategy matrix
The audit of write paths
The cache-key naming convention


The biggest invalidation mistake: **forgetting one write path.** You wire up `/api/users/:id` PUT to invalidate `user:123` cache. Six months later, an admin tool writes directly to the database and the cache stays stale. A user calls support. It takes 3 hours to debug. The fix: list every write path on day one; ensure each invalidates; add an integration test that checks "after write, cache is fresh." Treat cache invalidation as part of the contract for any data-mutation function.

## Cache Keys: Naming, Scoping, and the Tenant-Leak Problem

Bad cache keys produce the worst bug class — leaking one user''s data to another. Spend disproportionate effort on key design.

Help me design cache keys safely.

The cache-key naming convention:

Pattern: {namespace}:{entity}:{id}:{variant?}

Examples:

user:profile:123
tenant:abc:dashboard:metrics
tenant:abc:user:123:permissions
feature-flag:billing-redesign
pricing:tier-list:v2

Required components for tenant data:

Tenant identifier (always; never optional)
Entity type (user / dashboard / report / etc.)
Entity ID (specific record)
Variant (locale, format, version) if applicable

The tenant-isolation rule:

ANY data that''s tenant-private MUST include tenant:X in the cache key.

Examples:

❌ dashboard:metrics — leaks across tenants
✅ tenant:abc:dashboard:metrics — tenant-scoped

The user-isolation rule:

ANY data that''s user-private (within a tenant) MUST include user:Y in the cache key.

Examples:

❌ tenant:abc:notifications — leaks across users in same tenant
✅ tenant:abc:user:123:notifications — user-scoped

The "negative cache key" anti-pattern:

Some apps cache "no result" responses to avoid hammering DB on missing-record lookups. This is fine, but the key must include enough context that a real value insertion correctly invalidates.

The HTTP-cache-key trap:

CDN cache keys default to URL only. If /api/dashboard returns different data per user, you must:

Add Vary: Authorization header (so cache key includes auth header)
Or use Cache-Control: private (forbids CDN cache; only browser caches)
Or include user identity in URL (/api/users/123/dashboard)
Or use Surrogate-Key + per-user tags for explicit control

Cache key for static / public data:

marketing:home:v3
pricing:public:v2
blog:post:slug-here

These are intentionally not tenant-scoped (data is public).

Cache key versioning (for safe migrations):

When you change the data shape:

Increment the version: user:profile:v2:123
Old keys age out via TTL
No risk of mixing old + new data during deploy

The cache-key audit:

Quarterly: review all cache keys

Are tenant-scoped where required?
Are they consistent in naming?
Any "unsafe" keys (missing tenant)?

Anti-patterns:

Cache keys that don''t include user / tenant for private data
Inconsistent prefixes across services
Composite keys without separator (user123dashboard vs user:123:dashboard)
Hash-based keys without context (can''t debug; can''t enumerate)

Tooling:

Type your cache-key builder in TypeScript:

type CacheKey =
  | { type: 'user-profile'; userId: string }
  | { type: 'tenant-dashboard'; tenantId: string }
  | { type: 'feature-flag'; flagId: string };

function buildKey(k: CacheKey): string {
  switch (k.type) {
    case 'user-profile': return `user:profile:${k.userId}`;
    case 'tenant-dashboard': return `tenant:${k.tenantId}:dashboard`;
    case 'feature-flag': return `flag:${k.flagId}`;
  }
}

This makes it impossible to construct an invalid key.

For my app:

The cache-key naming convention
The tenant-leak audit
The type-safe builder

Output:

The naming convention doc
The tenant-leak audit list
The type-safe key builder


The biggest cache-key mistake: **caching tenant data with a non-tenant-scoped key.** This is a security incident waiting to fire. The first time a customer reports "I see another company''s data on my dashboard," your cache architecture is the problem. The fix is preventive: every cache key for private data MUST include tenant; lint / test for it; never let a `cache.set(key, data)` ship without a key that includes tenant.

## Cache Stampede and the Thundering Herd

When a hot cache entry expires, all concurrent requests miss simultaneously, all hit origin, origin melts. Plan for this.

Help me handle cache stampede.

The problem:

Cache key X has TTL 60 seconds
1000 req/sec for X
Cache expires at T=0
All 1000 requests at T=0 miss; all 1000 hit DB
DB melts under 1000 concurrent reads
Cache repopulates for one request; others wait or fail

This is "cache stampede" or "thundering herd."

Mitigations:

1. Probabilistic early refresh (XFetch algorithm)

Some fraction of requests just before TTL expires, refresh cache
Spreads load across the TTL window

2. Stale-while-revalidate

Serve stale cached value immediately
Asynchronously refresh in background
Per Cache-Control: stale-while-revalidate=N header

Cache-Control: max-age=60, stale-while-revalidate=300

Means: cache for 60s; after that, serve stale up to 300s while refreshing in background.

3. Single-flight (request coalescing)

When multiple concurrent requests miss the cache
Only ONE goes to origin; others wait for that result
Implement via in-memory mutex or Redis lock

async function getWithCoalescing(key: string) {
  // Try cache first
  const cached = await cache.get(key);
  if (cached) return cached;

  // Acquire lock
  const lockAcquired = await redis.set(`lock:${key}`, '1', 'NX', 'EX', 5);
  if (lockAcquired) {
    // Lock holder fetches from origin
    const value = await origin.fetch(key);
    await cache.set(key, value, 'EX', 60);
    await redis.del(`lock:${key}`);
    return value;
  } else {
    // Wait briefly; retry cache
    await sleep(100);
    return await cache.get(key);
  }
}

4. Background refresh

Cron / worker proactively refreshes hot cache entries before expiration
Origin load is predictable

5. Jittered TTLs

Don''t set TTL = exactly 60 seconds for all entries
Set TTL = 60 ± random(10) seconds
Spreads expiration across time

The pragmatic recipe for indie SaaS:

Use stale-while-revalidate at CDN
Use Redis lock for hot keys at app level
Add jitter to TTLs (10% randomness)
Monitor cache hit rate (should be >90% for hot data)

When NOT to optimize for stampede:

Low-traffic endpoints (no stampede possible)
Endpoints with heterogeneous keys (each key a few req/sec)

Premature stampede protection adds complexity. Add when monitoring shows it.

For my app:

The stampede risk per cached endpoint
The mitigations to implement
The monitoring to detect

Output:

The hot-key list
The stampede mitigations
The cache hit-rate dashboard


The biggest stampede mistake: **assuming it won''t happen until it does.** A homepage / dashboard endpoint serving 10K req/sec with a 60-second TTL is one cache miss away from a database meltdown. Stale-while-revalidate is a 1-line header change that prevents this; add it before you need it. Single-flight is harder; add it when monitoring shows hot-key contention.

## Cache Observability: Hit Rate, Latency, Memory

A cache without monitoring is a cache without correctness guarantees.

Help me observe my cache.

The metrics:

1. Hit rate

(cache hits) / (cache hits + cache misses)
Goal: >90% for hot data; >80% overall
Drop in hit rate signals: TTL too short / invalidation bug / data shape change

2. Latency

p50, p95, p99 cache read latency
Redis: ~1ms p95 typical
Distributed cache: <5ms p95
Application cache: <0.1ms

3. Memory usage

Cache size (bytes)
Eviction rate (when full, LRU eviction)
Goal: <80% of available memory; alert at 90%

4. Top keys

Most-frequently-accessed keys
Largest keys (memory consumers)
Helps spot stampede risk + memory hogs

5. Stale-data incidents

How often does production data not match cache?
Hard to measure without explicit checks
Useful: random 1% of reads also fetch from origin and compare

Tools:

Redis Insight / Redis CLI (INFO stats, MEMORY STATS)
Datadog / Grafana with Redis exporter
App-level metrics (StatsD / OpenTelemetry)
Custom dashboard in your observability stack (per error-monitoring-providers)

Alerts:

Hit rate <50%: investigate (regression)
Memory >90%: scale up Redis
Stampede detected (single-flight contention high): review hot keys
Stale-data report from customer: drop everything; debug

Cache "tests":

Write tests that verify:

After write, cache returns new value (not old)
Cache key includes tenant for tenant-scoped data
TTL behaves as expected (test with mocked clock)
Invalidation on write actually clears cache

For my app:

The metrics I track today
The metrics I need to add
The dashboard / alerting plan

Output:

The cache observability plan
The dashboard mockup
The alert thresholds


The biggest observability mistake: **shipping cache without metrics.** A cache that "works most of the time" is invisible until it doesn''t. Hit rate, latency, and memory are the minimum; add them on day one, not month six. Without metrics, the next stale-data incident takes 4 hours instead of 4 minutes to debug.

## Quarterly Cache Review

Caches drift. Build the review.

The quarterly cache review:

1. Hit-rate audit

Per cached endpoint: hit rate this quarter
Drops indicate: TTL too short, invalidation bug, data shape change
Investigate any rate <50%

2. Memory pressure

Are we approaching Redis memory limits?
Top 20 keys by memory consumption
Eviction rate: any keys getting evicted before TTL?

3. Stale-data incidents

Customer reports of stale data this quarter
Root cause per incident
Pattern detection (always same endpoint? always same write path?)

4. New cache opportunities

Slow endpoints that aren''t cached yet
New features that should be cached on launch

5. TTL review

Are TTLs still appropriate?
Has data update frequency changed?

6. Tenant-isolation audit

New cache keys added this quarter
Any missing tenant scope?

Cache decommissioning:

Some caches outlive their utility. Quarterly: which caches can we remove?

Rarely-hit (<1% of requests)
Underlying query is now fast (DB optimization made cache redundant)
Data flow changed (cache now wrong layer)

Removing dead caches simplifies code + reduces stale-data risk.

Output:

The QBR template
The owner (eng lead)
The decision log


The biggest review-cadence mistake: **never reviewing.** Caches added in year 1 might be stale assumptions in year 2. A cache designed for 1K req/sec might be wrong at 100K req/sec. Quarterly review keeps the cache layer aligned with current reality. Without it, you''re carrying assumptions that compound into bugs.

---

## What "Done" Looks Like

A working caching strategy in 2026 has:

- 5 layers explicitly assigned roles (browser / CDN / app / distributed / DB)
- TTL per data type with empirical justification (not vibes)
- Tenant-scoped cache keys for all private data
- Invalidation wired into every write path
- Stale-while-revalidate or single-flight for hot keys
- Hit-rate / latency / memory monitoring + alerts
- Type-safe cache-key builder
- Quarterly review baked in

The hidden cost of weak caching: **either a slow product (no cache; database overloaded) or stale-data bugs (cache without invalidation discipline).** Both kill trust. The middle ground — explicit layers, tenant-scoped keys, write-time invalidation, monitored hit rates — is more work upfront but compounds into a fast, correct, debuggable system. Skip it; pay later.

## See Also

- [Performance Optimization](performance-optimization-chat.md) — broader perf context
- [Database Migrations](database-migrations-chat.md) — schema changes affect cache shape
- [Multi-Tenancy](multi-tenancy-chat.md) — tenant-isolation principles
- [Audit Logs](audit-logs-chat.md) — cache invalidation events
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — cache + rate-limit overlap
- [Public API](public-api-chat.md) — API caching strategy
- [Service Level Agreements](service-level-agreements-chat.md) — SLA depends on cache reliability
- [Real-Time Collaboration](real-time-collaboration-chat.md) — when NOT to cache
- [Backups & Disaster Recovery](backups-disaster-recovery-chat.md) — cache loss recovery
- [VibeReference: CDN Providers](https://www.vibereference.com/cloud-and-hosting/cdn-providers) — CDN layer
- [VibeReference: Database Providers](https://www.vibereference.com/backend-and-data/database-providers) — Redis / Postgres
- [VibeReference: Vercel Functions](https://www.vibereference.com/cloud-and-hosting/vercel-functions) — Vercel runtime cache
- [VibeReference: Error Monitoring Providers](https://www.vibereference.com/devops-and-tools/error-monitoring-providers) — observability
- [LaunchWeek: SEO Strategy](https://www.launchweek.com/2-content/seo-strategy) — TTFB / Core Web Vitals depend on cache

[⬅️ Day 6: Grow Overview](README.md)