WebSocket & Server-Sent Events (SSE) Implementation: Real-Time Connections That Don't Wake You Up at 3am
If you're building a SaaS in 2026 with any real-time feature — live notifications, presence indicators, status updates, streaming AI responses, dashboards that update without refresh — you need persistent connections. Most founders default to "we'll just use WebSocket for everything," then six months in discover that WebSockets break behind corporate proxies, your serverless host charges per-second-connected, scaling beyond 10K concurrent connections requires re-architecture, and reconnection logic is its own engineering project.
A working real-time-connections strategy answers: WebSocket vs SSE vs long-polling, how do we authenticate connections, how do we reconnect on failure, how do we scale beyond a single server, and how do we know when something's broken. Done well, real-time features feel magical and stay invisible. Done badly, you're debugging "why didn't the customer get the update?" tickets every week.
This guide is the implementation playbook for real-time connections — picking the right pattern, authentication, reconnection, scaling, observability, and cost discipline. Distinct from Real-Time Collaboration (CRDT-based multiplayer specifically).
Pick the Right Pattern: WebSocket vs SSE vs Polling
Each pattern has different tradeoffs.
Help me pick a real-time pattern.
The four options:
**1. WebSocket (bidirectional, persistent)**
Client ←→ Server (continuous)
- Bidirectional (client can send AND receive)
- Persistent connection
- Lower overhead per message (after connection)
- Requires upgrade from HTTP
Pros:
- Bidirectional (chat, collaboration)
- Lowest per-message latency
- Industry standard for real-time
Cons:
- Connection state on server (memory)
- Doesn''t work behind some proxies
- More complex to scale
- Authentication trickier
Use for:
- Chat / messaging
- Collaboration / multiplayer
- Bidirectional control (terminals, games)
- Anything truly real-time + interactive
**2. Server-Sent Events / SSE (server-to-client only)**
Client ← Server (one-way stream)
- One-way (server pushes to client)
- HTTP-based (works through proxies)
- Native browser API (`EventSource`)
- Auto-reconnects
Pros:
- Simpler than WebSocket
- HTTP-friendly (proxies, CDN)
- Auto-reconnect built in
- Easy authentication (cookies / headers)
Cons:
- One-way only (server → client)
- 6-connection-per-domain browser limit
- Less common; some tooling weaker
Use for:
- Notifications (server pushes)
- Live updates / dashboards
- AI streaming responses
- Any server-pushed updates without need to receive from client
**3. Long-polling (HTTP request that holds open)**
Client → Server (request) Server holds for N seconds Server returns when data available OR timeout Client immediately sends new request
Pros:
- Pure HTTP (works everywhere)
- Simple
- Compatible with all proxies
Cons:
- Higher latency than WS / SSE
- More overhead per message
- Server must handle long-held connections
Use for:
- Fallback when WS / SSE not supported
- Very low message rate
- Legacy compatibility
**4. Short polling (request every N seconds)**
Client → Server (every 5 seconds)
Pros:
- Trivially simple
- Stateless
- Cacheable
Cons:
- Latency = polling interval
- Wasteful at scale
- Battery drain on mobile
Use for:
- Status polls (every 30-60 seconds)
- When sub-second latency not needed
- Trivial implementation worth it
**The pattern-by-use-case matrix**:
| Use case | Best pattern |
|---|---|
| Chat / messaging | WebSocket |
| Collaboration (CRDT) | WebSocket (per [real-time-collaboration](real-time-collaboration-chat.md)) |
| Live notifications | SSE |
| AI streaming response | SSE |
| Dashboard updates | SSE |
| Presence indicators | WebSocket (bidirectional heartbeat) |
| Game / terminal | WebSocket |
| Job status check | Polling |
| Webhook delivery | Long-polling fallback |
**The 90% answer for indie SaaS**:
- SSE for server-to-client (notifications, AI streaming)
- WebSocket for bidirectional (chat, collaboration)
- Polling for low-frequency status
Skip long-polling unless legacy compatibility forces it.
For my product:
- Real-time use cases inventory
- Pattern per use case
- Current implementation
Output:
1. The use-case inventory
2. The pattern choice per use case
3. The migration plan if needed
The biggest unforced error: WebSocket for everything. "Real-time = WebSocket" — but most "real-time" needs are server-to-client (notifications); SSE is simpler, more reliable, easier to scale. The fix: SSE-default; WebSocket only for bidirectional needs.
SSE Implementation (the underused choice)
For most "push from server" use cases, SSE is right.
Help me implement SSE.
The basic pattern:
**Server (Node.js / Hono / Express)**:
```typescript
app.get('/api/events', authenticate, async (req, res) => {
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'X-Accel-Buffering': 'no', // disable nginx buffering
});
const userId = req.user.id;
const channel = `events:user:${userId}`;
// Subscribe to Redis pub/sub
const subscriber = redis.duplicate();
await subscriber.subscribe(channel);
subscriber.on('message', (_channel, message) => {
res.write(`data: ${message}\n\n`);
});
// Heartbeat every 30s
const heartbeat = setInterval(() => {
res.write(': heartbeat\n\n');
}, 30000);
// Cleanup on disconnect
req.on('close', () => {
clearInterval(heartbeat);
subscriber.unsubscribe();
subscriber.quit();
});
});
Client (browser):
const eventSource = new EventSource('/api/events');
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
// Handle update
};
eventSource.onerror = () => {
// EventSource auto-reconnects; just log
console.log('SSE error; will reconnect');
};
The "X-Accel-Buffering: no" trick:
Nginx and other proxies buffer responses by default. SSE needs streaming.
Set X-Accel-Buffering: no to disable buffering. Without this, messages arrive in batches.
The "heartbeat" pattern:
Send periodic comments (: heartbeat\n\n) every 30s:
- Keeps connection alive through proxies
- Detects broken connections (writes fail)
- Detects client disconnect (close event)
Vercel-specific limitations:
Vercel Functions have execution time limits (300s default in 2026). Long-running SSE connections will be terminated.
Solutions:
- Reconnect from client (EventSource does this automatically)
- Use Vercel''s extended runtime modes
- Or: deploy to traditional server for true long-lived connections
- Or: use platform with dedicated WebSocket support (Pusher, Ably)
The "event names" pattern:
SSE supports event types:
event: notification
data: {"id": "1", "text": "Hello"}
event: presence
data: {"user_id": "123", "status": "online"}
Client:
eventSource.addEventListener('notification', (e) => {...});
eventSource.addEventListener('presence', (e) => {...});
Cleaner than putting type in payload.
Authentication:
EventSource sends cookies automatically. Use cookie-based auth.
For token-based: pass via query string (less secure) OR use proxy that adds auth header:
new EventSource('/api/events?token=' + jwt);
For my SSE:
- Endpoints to implement
- Backend pub/sub layer
- Reconnection strategy
Output:
- The SSE implementation
- The authentication approach
- The reconnection plan
The biggest SSE mistake: **forgetting to disable proxy buffering.** Messages arrive in 10-message batches every 30 seconds; "real-time" feels broken. The fix: `X-Accel-Buffering: no` header.
## WebSocket Implementation
For bidirectional use cases, WebSocket is right.
Help me implement WebSocket.
The basic pattern:
Server (using ws library on Node.js):
import { WebSocketServer } from 'ws';
const wss = new WebSocketServer({ port: 8080 });
wss.on('connection', async (ws, req) => {
// Authenticate
const token = new URL(req.url, 'http://localhost').searchParams.get('token');
const user = await verifyToken(token);
if (!user) {
ws.close(1008, 'Unauthorized');
return;
}
ws.userId = user.id;
// Subscribe to user-specific channel
const subscriber = redis.duplicate();
await subscriber.subscribe(`events:user:${user.id}`);
subscriber.on('message', (_channel, message) => {
ws.send(message);
});
ws.on('message', async (data) => {
const msg = JSON.parse(data);
// Handle client message (e.g., chat, action)
await processMessage(user.id, msg);
});
// Heartbeat (ping/pong)
ws.isAlive = true;
ws.on('pong', () => { ws.isAlive = true; });
ws.on('close', () => {
subscriber.unsubscribe();
subscriber.quit();
});
});
// Detect dead connections
const interval = setInterval(() => {
wss.clients.forEach((ws) => {
if (!ws.isAlive) return ws.terminate();
ws.isAlive = false;
ws.ping();
});
}, 30000);
Client:
class ReconnectingWebSocket {
ws: WebSocket;
reconnectAttempts = 0;
constructor(url: string) {
this.connect(url);
}
connect(url: string) {
this.ws = new WebSocket(url);
this.ws.onopen = () => {
this.reconnectAttempts = 0;
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
this.handleMessage(data);
};
this.ws.onclose = () => {
const delay = Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000);
setTimeout(() => {
this.reconnectAttempts++;
this.connect(url);
}, delay);
};
}
}
const ws = new ReconnectingWebSocket(`wss://api.example.com/ws?token=${jwt}`);
The "ws library" choice:
Popular options:
ws: most-popular Node.js WebSocket librarysocket.io: higher-level; fallbacks; multi-roomuWebSockets.js: high-performance C++-bound
For most: ws (simple) or socket.io (feature-rich).
Vercel limitations:
Vercel Functions don''t support persistent WebSocket connections (timeout 300s).
Use one of:
- Vercel + external WebSocket provider (Pusher, Ably, PartyKit)
- Hosted WebSocket platform (Soketi, Centrifugo)
- Traditional server for WebSocket (separate from Vercel app)
The room / channel pattern:
Group connections by room:
const rooms = new Map<string, Set<WebSocket>>();
ws.on('message', (data) => {
const { type, room } = JSON.parse(data);
if (type === 'join') {
if (!rooms.has(room)) rooms.set(room, new Set());
rooms.get(room).add(ws);
ws.room = room;
}
});
function broadcast(room: string, message: any) {
rooms.get(room)?.forEach(ws => {
ws.send(JSON.stringify(message));
});
}
Or: use Redis pub/sub for distributed broadcasting.
The reconnection state:
After reconnect:
- Re-authenticate
- Re-subscribe to channels
- Sync missed messages (request "since last seen ID")
Without state-aware reconnection: client misses messages during disconnect.
For my WebSocket:
- Server-side library
- Auth approach
- Reconnection state-restore
Output:
- The WebSocket implementation
- The Vercel-friendly architecture
- The reconnection strategy
The biggest WebSocket mistake: **deploying WebSocket on serverless.** Vercel function timeout terminates connections; client constantly reconnects; awful UX. The fix: external WebSocket provider OR traditional server for the WebSocket layer.
## Authentication for Persistent Connections
Auth is trickier for WS / SSE than for HTTP.
Help me handle authentication.
The challenges:
- WS upgrade is HTTP; can use headers
- After upgrade: no headers per message
- SSE: native EventSource doesn''t support custom headers
- Token expiration during long-lived connection
SSE auth options:
Option 1: Cookie-based (best for browsers)
// Cookie set on auth
res.cookie('session', sessionToken, { httpOnly: true, secure: true });
// EventSource sends cookie automatically
new EventSource('/api/events');
Option 2: Query-string token
new EventSource(`/api/events?token=${jwt}`);
// Server
const token = req.query.token;
const user = verifyJWT(token);
Less secure (logged in URL); but only option for non-browser clients.
Option 3: Polyfill EventSource with headers
Libraries like eventsource-polyfill add header support.
WebSocket auth options:
Option 1: Subprotocol-based
// Client
const ws = new WebSocket('wss://api.example.com', ['v1', `token-${jwt}`]);
// Server
const token = req.headers['sec-websocket-protocol'].split(',')
.find(p => p.trim().startsWith('token-'));
Option 2: Query-string token
const ws = new WebSocket(`wss://api.example.com?token=${jwt}`);
Option 3: First-message auth
// On connection: client sends auth message first
ws.onopen = () => {
ws.send(JSON.stringify({ type: 'auth', token: jwt }));
};
// Server: hold before processing other messages
ws.on('message', async (data) => {
if (!ws.authenticated) {
const { type, token } = JSON.parse(data);
if (type !== 'auth') {
ws.close(1008, 'Auth required');
return;
}
const user = await verifyToken(token);
if (!user) {
ws.close(1008, 'Bad token');
return;
}
ws.userId = user.id;
ws.authenticated = true;
return;
}
// Handle other messages
});
Token expiration:
JWT expires (usually 1 hour). Long-lived connection survives expiration.
Options:
- Server checks expiration on each message; closes if expired
- Client refreshes token; sends new token via message
- Server enforces max connection duration (force reconnect)
Best: token refresh + reconnect cycle.
Tenant isolation:
For multi-tenant SaaS (per multi-tenancy-chat):
// Subscribe to tenant-scoped channel
const channel = `events:tenant:${user.tenantId}:user:${user.id}`;
NEVER:
- Allow client to specify channel name (could subscribe to other tenant)
- Skip tenant isolation in real-time layer
For my auth:
- Token strategy
- Refresh approach
- Tenant isolation
Output:
- The auth flow
- The token refresh
- The tenant-isolation enforcement
The biggest auth mistake: **token in URL logged in proxy logs.** JWT in `?token=...` ends up in nginx access logs forever. The fix: cookie-based for browser; subprotocol or first-message for clients that need it.
## Scaling Real-Time Connections
10K concurrent connections is when single-server breaks.
Help me scale real-time.
The single-server limit:
A Node.js process can handle:
- ~10K concurrent connections (memory + CPU)
- ~50K with optimizations
- 100K+ requires specialized stack (Go, Erlang, Rust)
Beyond single-server: distributed.
Horizontal scaling pattern:
Client ─┬─ Server 1 ─┐
├─ Server 2 ─┼─ Redis (pub/sub)
└─ Server 3 ─┘
- Multiple servers
- Redis (or NATS / Kafka) for pub/sub
- Client connects to any server
- Server publishes; all servers receive; deliver to their connected clients
Implementation:
// Server-side
const subscriber = redis.duplicate();
await subscriber.subscribe('events:*');
subscriber.on('message', (channel, message) => {
// Find connected clients for this channel
const clients = getLocalClientsForChannel(channel);
clients.forEach(ws => ws.send(message));
});
// Publishing
async function notifyUser(userId, message) {
await redis.publish(`events:user:${userId}`, JSON.stringify(message));
// All servers receive; only server with this user''s connection delivers
}
Sticky sessions:
If using load balancer, sticky sessions help (client always connects to same server). Less critical with Redis pub/sub but improves perf.
Connection caps per server:
Set max connections per instance:
- Node.js: tune Node memory; ~5-10K typical
- Beyond: scale out
Hosted alternatives (skip the scaling problem):
- Pusher — managed real-time channels
- Ably — modern alternative
- PartyKit (Cloudflare) — edge real-time
- Soketi — self-hosted Pusher-compatible
- Centrifugo — modern self-hosted
Pros: zero scaling work Cons: per-message or per-connection cost
For most indie SaaS: hosted is right until volume justifies self-host.
The "channels" architecture:
Organize messages by channel:
events:user:{user_id} → personal notifications
events:tenant:{tenant_id} → tenant-wide events
events:room:{room_id} → chat rooms
events:dashboard:{dash_id} → dashboard updates
Each connection subscribes to relevant channels.
Avoid the "broadcast everything" anti-pattern:
Don''t broadcast every event to every connection. Filter at server side based on subscription.
For my system:
- Current connection count
- Scaling needs
- Hosted vs self-host
Output:
- The architecture
- The scaling plan
- The hosted alternative
The biggest scaling mistake: **vertical scaling forever.** "Just bigger server" works to ~10K connections; breaks beyond. The fix: horizontal scaling + Redis pub/sub OR hosted real-time provider. Plan from day one.
## Reconnection: The 80% of Real-Time Code
Connections drop. Plan for it.
Help me handle reconnection.
The patterns:
Auto-reconnect with backoff:
class ReconnectingClient {
delays = [1000, 2000, 5000, 10000, 30000];
attempt = 0;
connect() {
this.ws = new WebSocket(url);
this.ws.onopen = () => { this.attempt = 0; };
this.ws.onclose = () => {
const delay = this.delays[Math.min(this.attempt, this.delays.length - 1)];
setTimeout(() => {
this.attempt++;
this.connect();
}, delay);
};
}
}
Exponential backoff prevents thundering herd.
Resubscribe after reconnect:
this.ws.onopen = () => {
this.attempt = 0;
// Restore subscriptions
for (const channel of this.subscribedChannels) {
this.ws.send(JSON.stringify({ type: 'subscribe', channel }));
}
};
Catch up missed messages:
this.ws.onopen = () => {
// Send last-seen-message-id
this.ws.send(JSON.stringify({
type: 'sync',
last_id: localStorage.getItem('last_message_id'),
}));
};
this.ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
localStorage.setItem('last_message_id', msg.id);
// Process
};
Server tracks recent messages (Redis stream or DB); replays from last-id on reconnect.
The "online / offline" UX:
Show users connection status:
const [connected, setConnected] = useState(false);
useEffect(() => {
ws.onopen = () => setConnected(true);
ws.onclose = () => setConnected(false);
}, []);
return (
<div>
{!connected && <Banner>Reconnecting...</Banner>}
{/* ... */}
</div>
);
Better than silent failure.
Network change handling:
When user switches networks (WiFi → cellular):
- Browser fires
online/offlineevents - Force reconnect on
online
window.addEventListener('online', () => {
if (this.ws.readyState !== WebSocket.OPEN) {
this.connect();
}
});
Heartbeat to detect zombie connections:
setInterval(() => {
if (lastMessageReceived < Date.now() - 60000) {
this.ws.close();
this.connect();
}
}, 30000);
Without: zombie connections stay "open" silently for hours.
For my client:
- Reconnection strategy
- Resubscribe + catch-up
- Connection-status UX
Output:
- The reconnection client
- The catch-up strategy
- The UX patterns
The biggest reconnection mistake: **no reconnection logic at all.** Connection drops; never reconnects; user thinks app is broken; refreshes. The fix: auto-reconnect with backoff + resubscribe + missed-message catchup.
## Observability for Real-Time
Real-time bugs are harder to debug. Monitor.
Help me observe real-time connections.
The metrics:
Connection metrics:
- Active connection count
- Connections per second (rate)
- Average connection duration
- Disconnect rate
- Reconnect rate per client
Message metrics:
- Messages sent per second (server → client)
- Messages received per second (client → server)
- Average message latency
- Message size distribution
Error metrics:
- Connection errors / failures
- Authentication failures
- Message parsing errors
- Send failures
Per-channel metrics:
- Subscribers per channel
- Messages per channel
Tools:
- Datadog real-time monitoring
- Pusher / Ably built-in dashboards
- PostHog event tracking
- Custom dashboard via Redis stats
The "ping-pong heartbeat" log:
Log heartbeat success/failure to detect:
- Network issues
- Server overload
- Client browser tab backgrounded (browsers throttle)
The "disconnect reason" tracking:
ws.on('close', (code, reason) => {
log.info('ws.close', { user_id, code, reason: reason.toString() });
});
Standard codes:
- 1000: normal close
- 1001: going away (browser closed)
- 1006: abnormal close (network)
- 1008: policy violation (auth fail)
- 1011: server error
Per-connection-cost tracking:
If using hosted (Pusher / Ably):
- Per-message cost
- Per-channel cost
- Concurrent connection cost
Track to avoid surprise bills.
For my observability:
- Metrics tracked
- Tools used
- Alert thresholds
Output:
- The metrics dashboard
- The alerting
- The cost tracking
The biggest observability mistake: **no real-time-specific metrics.** Connection issues invisible; debugging "why didn''t the customer get the update?" with no data. The fix: connection / message / error metrics; per-channel breakdown.
## Avoid Common Pitfalls
Recognizable failure patterns.
The real-time mistake checklist.
Mistake 1: WebSocket for everything
- SSE simpler for server-push
- Fix: SSE for one-way; WS for bidirectional
Mistake 2: WebSocket on serverless
- Function timeout kills connections
- Fix: external provider or traditional server
Mistake 3: No reconnection logic
- Drop = silent failure
- Fix: auto-reconnect with backoff
Mistake 4: No proxy-buffering disable
- SSE messages batched
- Fix: X-Accel-Buffering: no
Mistake 5: No heartbeat
- Zombie connections
- Fix: ping/pong every 30s
Mistake 6: No tenant isolation
- Cross-tenant data leak
- Fix: tenant-scoped channels
Mistake 7: Token in URL logged
- Security risk
- Fix: cookie or subprotocol
Mistake 8: Single-server scaling
- Breaks at 10K connections
- Fix: Redis pub/sub + horizontal
Mistake 9: No catch-up on reconnect
- Lost messages
- Fix: resync from last-id
Mistake 10: No observability
- Bugs invisible
- Fix: metrics dashboard
The quality checklist:
- Pattern matches use case (SSE / WS / polling)
- Authentication mechanism appropriate
- Reconnection with backoff
- Heartbeat / dead-connection detection
- Tenant isolation in channels
- Token refresh strategy
- Pub/sub for horizontal scaling (or hosted)
- Catch-up on reconnect
- Connection-status UX
- Observability dashboard
For my system:
- Audit
- Top 3 fixes
Output:
- Audit results
- Top 3 fixes
- The "v2 real-time" plan
The single most-common mistake: **assuming real-time is the same as request-response.** Persistent connections have completely different operational characteristics: scaling, auth, reconnection, observability. The fix: treat real-time as its own discipline; learn the patterns; plan from day one.
---
## What "Done" Looks Like
A working real-time-connection system in 2026 has:
- Pattern matched to use case (SSE for one-way; WS for bidirectional)
- Authentication appropriate (cookie / subprotocol / first-message)
- Reconnection with exponential backoff + catch-up
- Heartbeat / dead-connection detection
- Tenant-isolated channels
- Token refresh during long-lived connections
- Horizontal scaling via pub/sub OR hosted provider
- Connection-status UX (online / offline / reconnecting)
- Observability metrics (connections / messages / errors)
- Cost discipline (per-message tracking if hosted)
The hidden cost of weak real-time: **silent failures that erode trust.** Customer doesn''t see updates; assumes feature is broken; refreshes; eventually stops trusting "real-time" in the product. Real-time done right is invisible; done wrong is broken-feeling. Plan from day one; treat as its own discipline; the magic of "live updates" pays off in user delight.
## See Also
- [Real-Time Collaboration](real-time-collaboration-chat.md) — CRDT-based multiplayer
- [In-App Notifications](in-app-notifications-chat.md) — common SSE use case
- [Outbound Webhooks](outbound-webhooks-chat.md) — adjacent push pattern
- [Inbound Webhooks](inbound-webhooks-chat.md) — adjacent
- [AI Features Implementation](ai-features-implementation-chat.md) — SSE for AI streaming
- [Caching Strategies](caching-strategies-chat.md) — adjacent
- [Database Connection Pooling](database-connection-pooling-chat.md) — adjacent connection-management
- [Service Level Agreements](service-level-agreements-chat.md) — uptime depends on real-time
- [Performance Optimization](performance-optimization-chat.md) — real-time perf
- [Multi-Tenancy](multi-tenancy-chat.md) — tenant-isolation in channels
- [Audit Logs](audit-logs-chat.md) — adjacent
- [VibeReference: Vercel Functions](https://www.vibereference.com/cloud-and-hosting/vercel-functions) — Vercel runtime
- [VibeReference: Vercel Queues](https://www.vibereference.com/cloud-and-hosting/vercel-queues) — adjacent eventing
- [VibeReference: Background Jobs Providers](https://www.vibereference.com/backend-and-data/background-jobs-providers) — adjacent
- [VibeReference: Database Providers](https://www.vibereference.com/backend-and-data/database-providers) — Redis pub/sub
[⬅️ Day 6: Grow Overview](README.md)