Outbound Webhooks: Send Events to Your Customers' Systems Without Losing Data

Outbound Webhook Strategy for Your New SaaS

Goal: Ship outbound webhooks that customers actually trust — every event delivered exactly once (or knowably-once), signed, retried with backoff, replayable from a UI, and observable end-to-end. Avoid the failure modes where founders fire-and-forget HTTP POSTs from inside a request handler (every flaky customer endpoint takes your app down with it), skip signing (customers' security teams reject the integration), or never build retry / replay (the first 4xx from a customer endpoint loses events forever).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Webhook sender pattern shipped in 1-2 days. Subscription management UI + signing + retry in week 1. Replay UI + customer-facing logs in week 2. Quarterly health review baked in from launch onward.

Why Most Founder Outbound Webhooks Are Broken

Three failure modes hit founders the same way:

Fire-and-forget from the request handler. Founder writes await fetch(customer.webhookUrl, …) inside the API endpoint that triggers the event. Customer's endpoint returns 504 in 28 seconds. Your API endpoint times out. Worse: the customer's endpoint is fine but slow, and your app's request queue fills up with handlers waiting on customer infrastructure. One slow customer can take your whole app down.
No signing. Customer's security team asks "how do we verify these came from you?" You shrug. They reject the integration. Or worse — they accept it, an attacker forges events, and the customer's system creates fraudulent records. Your name is on the post-mortem.
No retry, no replay, no log. Customer's endpoint returns 503 once during deploy; the event is lost. Customer asks "where's the event for invoice X?" You can't answer because there's no record. Or you have a record but no UI to replay; you spend 2 hours writing one-off SQL during an outage.

The version that works is structured: enqueue events for delivery, sign them, deliver via a worker pool with exponential-backoff retry, log every attempt, expose a customer-facing log, and provide internal replay tooling for incident recovery.

This guide assumes you have already done Public API (outbound webhooks pair with the API surface), have shipped Inbound Webhooks (the symmetric receive side), have considered Background Jobs Providers (the queue layer this guide depends on), and have shipped Audit Logs (webhook events feed audit).

1. Decide Which Events Customers Can Subscribe To

Designing the event catalog comes first. Get this wrong and you'll regret it for years.

Help me design the outbound-webhook event catalog for [your product] at [your-domain.com].

The events I'm considering:

**Lifecycle events** (most common starting set):
- `[resource].created` — a new [resource] was created
- `[resource].updated` — a [resource] was updated
- `[resource].deleted` — a [resource] was deleted

**Domain-specific events** (the valuable ones):
- For payments: `payment.succeeded`, `payment.failed`, `refund.created`
- For onboarding: `user.signed_up`, `user.activated`, `user.invited`
- For your domain: [list 5-10 events specific to your product]

**Critical design questions**:

1. **Granularity**: should `subscription.updated` fire on every column change, or should I split into `subscription.cancelled`, `subscription.upgraded`, etc.? Granular events are more useful but more work to implement and maintain.
2. **Payload shape**: do I send the full resource, just the diff, or just the ID and let the customer fetch the resource via API? Full resources are easiest for customers but expose internal fields; ID-only forces customers to make extra API calls.
3. **Versioning**: how do I evolve the payload schema without breaking customers? Per-event versioning (`v1`, `v2`) or per-API versioning?
4. **Internal vs customer-facing**: do I send every internal event, or only the ones that have a customer-meaningful interpretation?

**Naming conventions** (pick one and stick to it):
- `resource.action` (Stripe-style: `customer.subscription.updated`)
- `resource_action` (snake-case: `subscription_updated`)
- `RESOURCE_ACTION` (constants: `SUBSCRIPTION_UPDATED`)

**Anti-patterns to avoid**:
- Mixing styles across events
- Renaming events post-launch (this is a breaking change)
- Sending events for internal-only state changes that customers can't interpret
- Events that fire dozens of times per minute (creates noise; consider batching or aggregating)

**Output**:
1. The full event catalog as a table: name, payload shape, when it fires, frequency expectation, criticality
2. The naming convention you're locking in
3. The versioning strategy
4. The first 5 events to ship in v1

The single most undervalued upfront work: drawing the line between what's a customer-facing event and what's internal noise. Most teams ship too many events early, then can't deprecate without breaking customers.

2. Sign Every Event

Skipping signing is a security hole AND a sales blocker. Every customer security team asks about it.

Help me implement event signing for outbound webhooks.

**The pattern (Stripe-compatible HMAC-SHA256)**:

For each event delivery:
1. Compose the signed payload: `<timestamp>.<json_body>`
2. Compute HMAC-SHA256 using the customer's webhook secret
3. Set headers:
   - `[YourProduct]-Signature: t=<timestamp>,v1=<hmac>`
   - `[YourProduct]-Webhook-Id: <unique_event_delivery_id>`
   - `[YourProduct]-Event-Type: <event_type>`
4. POST to the customer's endpoint URL

**Customer's verification code (publish in your docs)**:
```ts
import crypto from 'crypto'

function verifyWebhook(rawBody: string, signatureHeader: string, secret: string) {
  const parts = signatureHeader.split(',').reduce((acc, p) => {
    const [k, v] = p.split('=')
    acc[k] = v
    return acc
  }, {} as Record<string, string>)

  const timestamp = parts.t
  const signature = parts.v1

  // Replay protection: reject if older than 5 minutes
  const age = Date.now() / 1000 - parseInt(timestamp)
  if (age > 300) throw new Error('Replay attack: timestamp too old')

  const expected = crypto
    .createHmac('sha256', secret)
    .update(`${timestamp}.${rawBody}`)
    .digest('hex')

  if (!crypto.timingSafeEqual(Buffer.from(expected), Buffer.from(signature))) {
    throw new Error('Invalid signature')
  }
}

Critical implementation rules:

Per-customer (or per-endpoint) secrets. Never share a single secret across customers. Each subscription gets its own secret, generated at subscription creation, retrievable once.
Show the secret once, then store hashed. Like API keys: customer copies it during setup; you store a hash for verification only if you also need server-side verification (rare for outbound). Keep the raw secret in a secrets store the worker can decrypt at delivery time.
Support secret rotation. Customer must be able to roll a secret. During rotation, send each event signed by both old and new secrets (for a rotation window) so they can switch verification without missing events.
Include the timestamp inside the signed payload. Replay attacks are trivial without it.
Use timing-safe comparison. Document this in your customer docs too.
Document signing in customer-facing docs. Include sample verification code in TypeScript, Python, Ruby, Go, PHP — the languages your customers use.

Don't:

Use HTTP basic auth alone
Use the same secret for staging and production
Skip the timestamp ("we'll just trust the request")
Ship without docs for verification — security teams will reject the integration

Output:

The signing code in [your language]
The secret-generation flow at subscription creation
The rotation procedure (UI + API)
The customer-facing verification docs in 3 languages


Three principles:

- **Sign every event.** Non-negotiable. Customers will ask. Their security teams will block deals over this.
- **Per-customer secrets, never shared.** Compromise of one customer doesn't compromise others.
- **Document verification in their language.** A customer who can't verify in 5 minutes won't integrate. Sample code in TypeScript / Python / Ruby is table stakes.

---

## 3. Send Asynchronously, Never Inline

The most consequential pattern. Decouples your app from customer infrastructure.

Help me design the enqueue-then-deliver pattern for outbound webhooks.

The flow:

Phase 1: Enqueue (inside your request handler, <50ms target)

The business event happens (subscription created, payment succeeded, etc.)
Look up active subscriptions for the customer/event-type

For each subscription, INSERT a row into webhook_deliveries:

CREATE TABLE webhook_deliveries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  subscription_id UUID NOT NULL REFERENCES webhook_subscriptions(id),
  event_type TEXT NOT NULL,
  event_payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',  -- pending / delivering / delivered / failed
  attempt_count INT NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMP NOT NULL DEFAULT NOW(),
  last_attempt_at TIMESTAMP,
  last_response_status INT,
  last_response_body TEXT,
  last_response_headers JSONB,
  last_error TEXT,
  delivered_at TIMESTAMP,
  created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_pending_deliveries ON webhook_deliveries(next_attempt_at)
  WHERE status IN ('pending', 'failed');

Enqueue a background job per delivery (per Background Jobs Providers)
Return from the request handler immediately

Phase 2: Deliver (background worker)

Worker picks up the delivery
Marks it delivering
Composes the signed request and POSTs to the customer's endpoint
On 2xx: marks delivered, sets delivered_at
On 4xx (other than 429): marks failed permanently — customer's endpoint is misconfigured; don't retry forever
On 5xx, 429, network error, timeout: increments attempt_count, schedules retry with exponential backoff
Captures response status, body (first 4KB), headers, and time-to-respond

Why this pattern matters:

Customer endpoint flakiness doesn't affect your app. Their 30-second timeout never blocks your request handler.
Retryable: every delivery is in the table; retries pick from next_attempt_at < NOW() with exponential backoff.
Replayable: every event is in the table; manual replay is UPDATE webhook_deliveries SET status='pending', next_attempt_at=NOW() WHERE id=?.
Observable: dashboards show delivery rate, latency p50/p95/p99, failure rate per customer.

Critical rules:

Never call fetch() to a customer endpoint inside the request handler
Never trust that the queue is "fast enough" — the source request must complete without it
Never retry forever; cap retries (typical: 24 attempts over 3 days, exponential backoff)
Always persist the response — it's the only way to debug customer-side issues

Output:

The webhook_deliveries schema migration
The Phase 1 (enqueue) helper code
The Phase 2 (delivery worker) code
The retry / dead-letter policy
Dashboard queries for webhook health per customer


The single most important insight: **the customer's endpoint is part of their infrastructure, not yours.** Treat every delivery as an external HTTP call that can fail in arbitrary ways. Never let their failures cascade into your request path.

---

## 4. Retry With Exponential Backoff and a Hard Cap

Retry is what makes webhooks reliable. Without it, every transient blip loses an event.

Design the retry strategy.

Retry schedule (Stripe-compatible default):

Attempt	Delay from previous
1	immediate
2	5 minutes
3	30 minutes
4	2 hours
5	5 hours
6	10 hours
7	1 day
... up to attempt 24 over 3 days ...
Final	mark permanently failed

Retry triggers (retry on these):

5xx response status
429 response status (respect Retry-After header if present)
Network error (connection refused, DNS failure, TCP reset)
Timeout (set to 10-30 seconds; never longer)

Don't retry on (mark failed permanently):

2xx (success — done)
3xx (redirects — customer's endpoint moved; require them to update)
4xx other than 429 (customer's endpoint is misconfigured; retrying won't help)

Implementation:

Use a job queue with delayed jobs (BullMQ, Inngest, Temporal, Trigger.dev)
Or use SQL polling: SELECT … FROM webhook_deliveries WHERE next_attempt_at < NOW() AND status IN ('pending', 'failed') LIMIT 100 FOR UPDATE SKIP LOCKED
Cap concurrent retries per customer (one slow customer shouldn't starve others)

Critical rules:

Per-customer concurrency limits. Otherwise one customer's slow endpoint blocks all workers.
Respect Retry-After. Customer's rate limiter is telling you something. Honor it.
Hard cap on attempts. 24 attempts over 3 days is typical. Past that, it's noise; mark dead.
Notify customer when delivery permanently fails. Email or in-app: "We couldn't deliver event X to your endpoint. It's been retried for 3 days. Check your endpoint and replay from the dashboard."
Don't retry forever even on 5xx. A customer's broken endpoint shouldn't fill your queue indefinitely.

Per-customer circuit breaker (advanced, recommended):

If a customer's endpoint has returned 5xx for >50% of last 100 requests, pause new deliveries for 5 minutes
This prevents one customer's outage from creating a backlog that drowns all workers
Resume gradually after the circuit half-opens

Output:

The retry-schedule code
The per-customer concurrency limiter
The Retry-After header handler
The dead-letter notification email
The circuit-breaker (if implementing)


The biggest mistake: **retrying every failure forever.** Customer's endpoint returns 404? They typo'd the URL. Retrying for 3 days won't help and will fill your queue with junk. Cap retries; surface failures to the customer.

---

## 5. Build Customer-Facing Logs From Day 1

Customers need to debug their own integrations. The logs UI is the difference between "this works" and "support ticket every week."

Design the customer-facing webhook logs UI.

A page at /dashboard/webhooks/[subscription-id]/deliveries that shows:

For each delivery:

Event type and ID
Timestamp
Endpoint URL it was sent to
HTTP status code returned (color: green for 2xx, red for 4xx/5xx, yellow for retrying)
Latency
Attempt number (1 of N)
Next retry time (if pending)
"View payload" — full JSON of the event
"View response" — status code + first 4KB of response body
"Replay" button — re-enqueues the delivery

Filters:

Status: all / delivered / failed / pending
Event type: dropdown of subscribed event types
Date range: last 1h / 24h / 7d / 30d
Endpoint URL (if customer has multiple subscriptions)

Aggregate metrics at the top:

Total events sent (last 24h / 7d)
Success rate (% delivered on first attempt vs % requiring retry vs % permanently failed)
p50 / p95 latency
Most common failure status

Why this matters:

Customer can debug their own endpoint without filing a support ticket
Reduces support load by 70%+ for webhook-related questions
Customer trust: they see exactly what you sent and what their endpoint returned
Self-service replay: customer can recover from their own outages without your help

Implementation notes:

Retain delivery records for at least 30 days (some products do 90 days for paying tiers)
Truncate response bodies to first 4KB (more is noise)
Redact sensitive fields if your event payloads contain them (rare; outbound payloads should be customer-owned data)
Provide a JSON / CSV export for customer's incident reports
Make the replay button rate-limited (1 per 5 seconds) so customers don't accidentally DDoS themselves

Output:

The logs UI component
The delivery-detail modal
The replay endpoint (with rate limit)
The aggregate-metrics queries
The retention policy (30d / 90d / etc.)


The single biggest reduction in support ticket volume: **let the customer see their own webhook logs.** Without it, every "did the event fire?" becomes a support ticket. With it, customers debug in 30 seconds.

---

## 6. Build Internal Replay Tooling

When something goes wrong on your side, you need to replay events.

Design the internal admin webhook replay tool.

Per Internal Admin Tools, build a page at /admin/webhook-deliveries that shows:

Filters:

Customer / subscription
Event type
Status (all / delivered / failed / pending)
Date range
Endpoint URL

For each delivery:

Same fields as customer-facing logs
Plus: customer ID, internal event ID, retry history
Plus: "Force replay" button (works even on delivered deliveries — for recovery scenarios)
Plus: "Cancel future retries" button (when a customer's endpoint is broken and you need to stop the queue from filling)

Bulk operations:

Replay all failed deliveries for a customer in date range
Replay all deliveries of an event type that fired during a specific bug window
Cancel all pending deliveries to a misconfigured endpoint (with confirmation)

Use cases:

A bug shipped that caused events not to enqueue for 2 hours; backfill them by replaying from the source events table
A customer's endpoint was down for a day; replay all events that landed in dead-letter
Internal data correction: an event payload had a wrong field; replay with the corrected payload (mark as "corrected" in the audit log)

Critical rules:

Replay is destructive in customer-visible ways. Confirm twice for bulk operations.
Audit every replay. Per Audit Logs, log who replayed what and why.
Rate-limit bulk replays. Don't enqueue 10,000 deliveries in one second; spread them.
Communicate with customers when you do bulk replays. "We re-sent 47 events from yesterday's outage; check your endpoint for duplicates."

Output:

The admin-only replay UI
The bulk-replay endpoint with safeguards
The audit-log entries for every replay
The customer notification template for bulk replays


The single most important capability during an incident: **the ability to replay events at scale, with audit, without writing one-off SQL.** Build it before you need it; you will need it.

---

## 7. Manage Subscriptions

Customers create subscriptions. The CRUD UI is unglamorous but essential.

Design the customer-facing webhook subscription management.

A page at /dashboard/webhooks that lets customers:

Create a subscription:

Endpoint URL (validate: HTTPS only; basic URL validation; optional ping test before saving)
Event types to subscribe to (multi-select; "all events" option for simplicity)
Description (free-text; helps customers track which integration owns this)
On save: generate the signing secret; show it once; require copy confirmation

View subscriptions:

Endpoint URL
Events subscribed
Active / paused
Created date
Last successful delivery
Health status (% success rate over last 100 deliveries)

Edit a subscription:

Change events
Change description
Pause / resume
Rotate secret (generate new; show once; old secret valid for 24h to allow customer-side rollover)

Delete a subscription:

Confirm
Soft-delete (keep delivery history for 30d for customer's records)

Send a test event:

"Send test event" button
Sends a test.ping event with a known payload
Lets customer verify their endpoint is wired correctly before relying on real events

Critical UX details:

Show the signing secret ONLY at creation and rotation; never on the list page
Provide copy-to-clipboard with a "I've copied this" confirmation before dismissing
Validate the endpoint URL is HTTPS (HTTP rejected; insecure)
Reject endpoint URLs on private CIDR ranges (10.x, 172.16.x, 192.168.x, 127.x) — SSRF protection
Limit to N subscriptions per customer (typical: 5-10) to prevent abuse

Output:

The subscription list page
The create / edit modal
The secret-display flow (one-time view)
The test event endpoint
The SSRF protection on endpoint URL


The biggest security gotcha: **SSRF via webhook endpoint URLs.** A malicious customer registers a subscription with `http://169.254.169.254/latest/meta-data/iam/security-credentials/` (AWS metadata endpoint) and your worker dutifully POSTs your IAM credentials. Block private CIDRs at registration AND at delivery time.

---

## 8. Monitor Webhook Health

Without monitoring, the first time you learn about a problem is when a customer files a ticket.

Design webhook monitoring per PostHog Setup and your alerting stack.

Metrics to track:

webhook.event.enqueued — count of events queued per minute, by event type
webhook.delivery.attempted — count of delivery attempts, by customer
webhook.delivery.succeeded — count of 2xx responses
webhook.delivery.failed — count of permanent failures (after retries exhausted)
webhook.delivery.latency — histogram of customer-endpoint response time
webhook.queue.depth — current pending deliveries (alert if growing unboundedly)
webhook.retry.rate — % of deliveries requiring retry (high rate = customer endpoint problems)

Alerts:

Queue backup: pending deliveries > 10K → page on-call (worker died or queue stalled)
Customer endpoint down for >24h: notify customer (their integration is broken)
Failure rate spike: company-wide failure rate >5% over 5 minutes → investigate (ours might be broken)
Per-customer failure rate spike: customer's failure rate >50% in last hour → notify customer
Worker errors: any unhandled exception in the worker → log + alert (bug in delivery code)

Dashboards:

Per-customer webhook health: success rate, p95 latency, failure breakdown
Global health: queue depth, deliveries/min, retry rate, dead-letter rate
Slow-customer ranking: customers with highest p95 latency (these are the candidates for circuit-breaker)
Top failure reasons: status codes, error types

Output:

The metrics emission code in the delivery worker
The alert rules
The Grafana / PostHog dashboards
The on-call runbook for webhook incidents


The single most-watched metric: **queue depth over time.** It should trend toward 0. If it's growing, something's wrong (worker dead, customer endpoint slow, code bug). Alert on this above all else.

---

## 9. Document for Customers

Customers won't integrate if the docs aren't great. Treat docs as part of the v1 ship.

Help me draft the customer-facing webhook documentation.

Sections:

Overview

What webhooks are (one paragraph; assume customer hasn't used them before)
What events are available
Link to event reference

Event reference

For each event type:
- When it fires
- Full payload schema with example JSON
- Frequency expectation
- Versioning notes

Setup guide

How to create a subscription in the dashboard
How to choose events
How to retrieve the signing secret

Verifying signatures

The signing scheme (HMAC-SHA256, timestamp + body)
Sample verification code in TypeScript, Python, Ruby (and Go if your customers use it)
Common verification mistakes (parsed body vs raw body; non-timing-safe comparison)

Handling deliveries

Respond with 2xx within 10 seconds
Idempotency: how to handle duplicate deliveries (you'll occasionally retry; customer must dedupe)
Order: events are NOT guaranteed in order; design for any order
Rate: link to expected event frequency per event type

Retries

Your retry schedule (publish it; customers plan around it)
What you retry on (5xx, 429, network errors)
What you don't retry on (4xx)
Where to find delivery logs

Replay

How to replay from the dashboard
When to replay (their endpoint outage; your bug; testing)

Best practices

Use a queue on their side too
Respond fast (move work to async)
Don't bake signing secrets into client code
Rotate secrets periodically

FAQ

What if my endpoint goes down?
Can I test without affecting production?
Can I replay historical events?
What's the rate limit on event delivery?
How long are delivery logs retained?

Output:

The docs page structure
The event reference table
Sample verification code in 3 languages
The retry schedule published as a public commitment
The FAQ


The biggest predictor of integration adoption: **5-minute time-to-first-event.** If a customer can register a URL, send a test event, and verify a signature in 5 minutes, they'll integrate. If it takes 30 minutes of doc-reading, most won't.

---

## 10. Review Quarterly

Outbound webhooks rot. Quarterly review keeps them healthy.

The quarterly outbound-webhook review checklist.

Health metrics:

What's the company-wide delivery success rate? Target: >99% on first attempt.
p95 customer-endpoint latency? If trending up, why?
Dead-letter rate? Are we losing customers to broken integrations?
Top 5 customers by event volume — are their endpoints healthy?

Coverage metrics:

What % of paying customers have at least one subscription? Target trends up over time.
What % of customers who created a subscription are still receiving events 30 days later?
Average events per active subscription per day?

Event catalog hygiene:

Are there events nobody subscribes to? (Candidates for deprecation.)
Are there events with high volume that customers complain about? (Candidates for batching.)
Are there missing events customers ask for? (Candidates for v1.X.)

Security review:

Has anyone rotated secrets recently? Encourage it.
Are there subscriptions with stale endpoints (>90d no successful delivery)? Notify customers; auto-pause.
Are there subscriptions sending to suspicious URLs? Audit.
Is the SSRF protection still effective? Test with private-CIDR and metadata-endpoint URLs.

Cost review:

Worker hours spent on retries — are we burning cycles on dead endpoints?
Storage of delivery logs — is retention right-sized?

Documentation review:

Are docs current with the event catalog?
Are sample-code snippets still working?

Output:

Health snapshot
3 actions to improve next quarter
2 deprecations to communicate
1 v1.X event to add


---

## What "Done" Looks Like

A working outbound webhook system in 2026 has:

- **A documented event catalog** with naming convention locked in
- **Per-customer signing secrets** with rotation
- **Async delivery via queue** — never inline
- **Exponential-backoff retry** with a hard cap (typically 24 attempts over 3 days)
- **Customer-facing delivery logs** with replay
- **Internal admin replay tooling** for bulk operations during incidents
- **SSRF protection** at registration and delivery
- **Per-customer concurrency limits** so one slow endpoint doesn't starve others
- **Monitoring + alerting** on queue depth, failure rate, p95 latency
- **Public docs** with sample verification code in 3+ languages
- **Quarterly health review** baked into the team rhythm

The webhook system you build in week 1 will look broken by year 2 if you don't review it. The teams that invest in tooling, docs, and monitoring keep customer trust; the teams that fire-and-forget lose customers to integration outages they never even noticed.

---

## See Also

- [Inbound Webhooks](inbound-webhooks-chat.md) — the symmetric receive side
- [Public API](public-api-chat.md) — outbound webhooks pair with the API surface
- [Audit Logs](audit-logs-chat.md) — every replay logged
- [Internal Admin Tools](internal-admin-tools-chat.md) — where the replay UI lives
- [Background Jobs Providers](https://www.vibereference.com/backend-and-data/background-jobs-providers) — the queue layer
- [Notification Providers](https://www.vibereference.com/backend-and-data/notification-providers) — for dead-letter customer notifications
- [Error Monitoring Providers](https://www.vibereference.com/devops-and-tools/error-monitoring-providers) — for worker exception tracking

[⬅️ Growth Overview](README.md)