HTTP Retry & Backoff: Make Your Third-Party API Calls Survive Without DDoS-ing Anyone
If your SaaS calls Stripe, Resend, OpenAI, Twilio, or any third-party API in 2026, your code WILL hit transient failures — rate limits, 502s, network blips, brief downtime. Naive code crashes; slightly-less-naive code retries instantly (and gets banned for hammering); right code uses exponential backoff with jitter, classifies errors, respects Retry-After headers, and gives up after a sensible cap. Most indie SaaS gets this wrong and silently loses 0.5-3% of operations to transient errors that should have succeeded. Worse: aggressive retry logic gets the API to ban your IP or rate-limit your account.
A working retry strategy answers: which errors are retryable (5xx and network errors yes; 4xx mostly no), how many attempts (3-5 typical), how long between (exponential backoff with jitter), what's the timeout per attempt and total, what's the idempotency strategy (don't double-charge), and how do you observe (metrics + alerts on retry storms).
This guide is the implementation playbook for retry / backoff / circuit breaker patterns. Companion to Idempotency Patterns, Rate Limiting & Abuse, Webhook Signature Verification, Outbound Webhooks, and Background Jobs.
Why Retry Matters
Get the failure modes clear first.
Help me understand transient failures.
The categories of "should retry" failures:
**1. Network errors**
- Connection reset / refused
- Timeout (connect or read)
- DNS resolution failure
- TLS handshake failure
Retryable: YES. Almost always transient.
**2. HTTP 5xx**
- 500 Internal Server Error: server bug or transient state
- 502 Bad Gateway: upstream issue
- 503 Service Unavailable: overloaded/maintenance
- 504 Gateway Timeout: upstream timed out
Retryable: YES, with backoff.
**3. HTTP 429 Too Many Requests**
- Rate limit hit
Retryable: YES, but RESPECT Retry-After header. No backoff = ban.
**4. HTTP 408 Request Timeout**
- Server didn't get full request
Retryable: YES; assume idempotency.
**5. Connection drops mid-response**
- TCP RST during data
- Body truncation
Retryable: YES (idempotency-key required).
**The "do NOT retry" failures**:
- 400 Bad Request: your request is wrong; retry won't fix
- 401 Unauthorized: auth issue; retry won't help
- 403 Forbidden: permission issue
- 404 Not Found: doesn't exist
- 422 Unprocessable: validation failed
- Logic errors in your code
The "ambiguous, situation-dependent":
- 409 Conflict: maybe retry (optimistic concurrency); often won't help
- 423 Locked: maybe wait + retry
- 425 Too Early: yes retry
For my app:
- Which APIs / endpoints
- What error rate you see today
Output:
1. Error classification per API
2. Retry policy per category
3. Test cases needed
The biggest unforced error: retrying 4xx errors. They won't succeed; you waste latency, log spam, and possibly money (paid API calls). Classify carefully.
Exponential Backoff with Jitter
Help me get the math right.
The naive (wrong) approach:
```typescript
for (let i = 0; i < 3; i++) {
try {
return await fetch(url);
} catch (e) {
await sleep(1000); // Fixed delay
}
}
Two problems: (1) immediate retry is too aggressive; (2) no jitter = thundering herd on shared failure.
The right approach: exponential with jitter:
async function retryWithBackoff<T>(
fn: () => Promise<T>,
options = {
maxAttempts: 5,
baseDelayMs: 200,
maxDelayMs: 30000,
jitter: 'full', // 'full' | 'equal' | 'none'
}
): Promise<T> {
let lastError: unknown;
for (let attempt = 0; attempt < options.maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
if (!isRetryable(err) || attempt === options.maxAttempts - 1) throw err;
const exponential = Math.min(
options.baseDelayMs * Math.pow(2, attempt),
options.maxDelayMs
);
let delay: number;
switch (options.jitter) {
case 'full':
delay = Math.random() * exponential;
break;
case 'equal':
delay = exponential / 2 + Math.random() * (exponential / 2);
break;
case 'none':
delay = exponential;
break;
}
await sleep(delay);
}
}
throw lastError;
}
The progression (baseDelay=200ms, jitter=full):
Attempt 0 fails → wait 0-200ms → retry Attempt 1 fails → wait 0-400ms → retry Attempt 2 fails → wait 0-800ms → retry Attempt 3 fails → wait 0-1600ms → retry Attempt 4 fails → throw
Why jitter (the AWS thundering-herd insight):
Without jitter: 1000 clients hit a flaky service; all fail; all retry exactly 1s later; service crushed again; everyone fails again; retry exactly 2s later; same.
With jitter: failures spread across the wait window. Service recovers gracefully.
Three jitter modes:
- Full (0 to base × 2^attempt) — most spread; recommended default
- Equal (half base + random half) — guarantees minimum delay
- None — no jitter; only if your traffic is rare/serial
Rule of thumb: full jitter for high-volume; equal jitter for low-volume.
The cap (max delay):
Without cap: attempt 10 = 200ms × 2^10 = 200s wait. Probably gave up by then. Default cap: 30s. Anything longer is unrecoverable from caller perspective.
For my code:
- Languages
- HTTP client
Output:
- Backoff function
- Jitter mode pick
- Defaults table (maxAttempts, baseDelay, maxDelay)
The single most-impactful detail: **jitter**. Many engineers skip it because the math feels arbitrary. It isn't — jitter is what saves your upstream from cascading failures. Always jitter.
## Respecting Retry-After Headers
Help me handle Retry-After.
Servers communicate retry timing via headers:
Retry-After: 60 (seconds) Retry-After: Mon, 30 Apr 2026 10:00:00 GMT (date) X-RateLimit-Reset: 1714478400 (Stripe-style; unix seconds)
Implementation:
function parseRetryAfter(headers: Headers): number | null {
// Standard Retry-After
const retryAfter = headers.get('retry-after');
if (retryAfter) {
const seconds = parseInt(retryAfter);
if (!isNaN(seconds)) return seconds * 1000;
// Date format
const date = new Date(retryAfter);
if (!isNaN(date.getTime())) {
return Math.max(0, date.getTime() - Date.now());
}
}
// Stripe-style
const reset = headers.get('x-ratelimit-reset');
if (reset) {
return Math.max(0, parseInt(reset) * 1000 - Date.now());
}
return null;
}
async function retry(fn: () => Promise<Response>) {
for (let attempt = 0; attempt < 5; attempt++) {
const resp = await fn();
if (resp.ok) return resp;
if (resp.status === 429 || resp.status === 503) {
const retryAfter = parseRetryAfter(resp.headers);
const delay = retryAfter ?? exponentialBackoff(attempt);
await sleep(delay);
continue;
}
if (resp.status >= 500 && resp.status < 600) {
await sleep(exponentialBackoff(attempt));
continue;
}
return resp; // Non-retryable
}
}
The rule: if server says "wait N seconds," wait N seconds. Don't second-guess.
Stripe gives Retry-After on 429s; OpenAI gives it via x-ratelimit-* headers; AWS uses Retry-After or X-Amz-Retry-After.
The cap for Retry-After:
Sometimes server says "wait 6 hours." Probably you should give up, not wait.
const MAX_RETRY_AFTER_MS = 60_000; // 1 minute
const delay = Math.min(retryAfter, MAX_RETRY_AFTER_MS);
if (retryAfter > MAX_RETRY_AFTER_MS) {
// Don't retry; raise to caller / queue for later
throw new RetryDelayTooLongError();
}
Decision per use case: short wait → retry; long wait → defer.
For my integrations: [Stripe, OpenAI, etc.]
Output:
- Per-API header parsing
- Cap rules
- Defer-to-queue handling
The mistake to avoid: **ignoring Retry-After and using your own backoff**. The server knows when capacity will recover; you don't. Trust the header.
## Idempotency: Retry Without Double-Charging
Help me make retries safe.
The problem: retried request may have already succeeded — you just didn't see the response.
Example: charge $100. Network drops mid-response. Retry. Did you charge $100 once or twice?
The solution: idempotency keys.
Pattern (Stripe / modern APIs):
const idempotencyKey = uuidv4(); // Generated ONCE at operation start
async function chargeWithRetry(amount: number) {
return retryWithBackoff(() =>
fetch('https://api.stripe.com/v1/charges', {
method: 'POST',
headers: {
'Idempotency-Key': idempotencyKey, // SAME on every retry
Authorization: `Bearer ${STRIPE_KEY}`,
},
body: new URLSearchParams({ amount: String(amount), currency: 'usd' }),
})
);
}
The server uses idempotency key to recognize "this is the same operation" and return the original result on retry.
The discipline:
- Generate idempotency key at the top of the operation (NOT per retry)
- Persist the key (so if your service restarts mid-retry, you keep using the same key)
- Use the key as long as you want "same operation" semantics (typically 24h server-side)
Storage pattern:
CREATE TABLE pending_charges (
id UUID PRIMARY KEY,
idempotency_key UUID NOT NULL UNIQUE,
amount INT NOT NULL,
status VARCHAR(20),
attempt_count INT DEFAULT 0,
last_attempt_at TIMESTAMPTZ,
result_json JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
On retry: check status; if succeeded, return cached result; if pending, retry with same key; if failed_terminal, give up.
For YOUR API:
If you offer an API, accept Idempotency-Key headers. Save callers from this pain.
For my integrations:
- Which APIs support idempotency?
- Storage of pending operations
Output:
- Idempotency-key strategy per API
- Storage schema
- Recovery patterns
The unforgiving lesson: **double-charge bugs are unforgivable**. One retry that succeeds twice = customer support ticket + chargeback + bad review. Idempotency keys are the only defense; use them anywhere money / state changes.
## Timeouts (the Other Half)
Help me set timeouts.
Without timeouts, "retry" is meaningless — a hung request waits forever.
Two timeouts per request:
1. Connection timeout (how long to establish TCP/TLS):
- Default: 10s
- Tighter for local-net: 1-3s
- Looser for high-latency endpoints: 30s
2. Read timeout / total (how long to receive response):
- Default: 30s
- Stripe: 80s default
- OpenAI streaming: 120s+ acceptable
- Critical user-facing: 5-10s
The HTTP client setup:
Node.js fetch:
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30_000);
try {
const resp = await fetch(url, { signal: controller.signal });
return resp;
} finally {
clearTimeout(timeoutId);
}
Axios:
axios.get(url, { timeout: 30000 });
ky / undici:
ky.get(url, { timeout: 30000 });
Total timeout (across retries):
If user-facing operation has 30s budget and per-attempt timeout is 30s, you can't retry. So:
Per-attempt timeout: 5s
Max attempts: 3
Backoff between: up to 1s + 2s = 3s
Total worst case: 5+1+5+2+5 = 18s
This fits under a 30s user budget.
For background work (no user waiting):
- Per-attempt: 30-60s
- Max attempts: 5-10
- Total: minutes-hours OK
Cancellation propagation:
If user closes browser, your server should cancel upstream requests:
// Vercel Functions / Next.js
export async function POST(req: Request) {
const upstreamCtrl = new AbortController();
req.signal.addEventListener('abort', () => upstreamCtrl.abort());
return await fetch(upstream, { signal: upstreamCtrl.signal });
}
Saves wasted work when user gives up.
For my requests:
- User-facing budget per call
- Background acceptable timeouts
Output:
- Timeout config per type of call
- Total timeout math
- Cancellation pattern
The bug most teams ship: **no timeout = thread / connection leak**. One slow upstream = pile of stuck requests. Eventually OOM or pool exhaustion. Always set both timeouts.
## Circuit Breaker: When to Stop Trying
Help me set up circuit breaker.
The idea: if the upstream is consistently failing, stop hammering it. Give it a chance to recover.
The pattern:
Closed (normal)
↓ failures pile up (e.g. 10 failures in 60s)
Open (block all requests; fail fast)
↓ wait period (30-60s)
Half-Open (allow 1 trial request)
↓ trial succeeds → Closed
↓ trial fails → back to Open
Implementation (using opossum or similar):
import CircuitBreaker from 'opossum';
const breaker = new CircuitBreaker(callStripeAPI, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 30000, // 30s in Open state
rollingCountTimeout: 10000, // 10s window for stats
});
breaker.on('open', () => console.log('Circuit OPEN'));
breaker.on('halfOpen', () => console.log('Trial request'));
breaker.on('close', () => console.log('Circuit CLOSED'));
await breaker.fire(args);
Per-upstream circuit breakers:
Don't share one breaker across all third parties. If Stripe is down, Resend is fine.
const stripeBreaker = new CircuitBreaker(stripeCall, options);
const resendBreaker = new CircuitBreaker(resendCall, options);
When circuit is open: fast-fail or fallback:
- Fast-fail: throw error; let caller decide
- Fallback: return cached data; queue for later; degrade gracefully
For email: queue + retry later. For payments: tell user "try again in a moment" — don't fall back. For a recommendation API: serve cached / generic.
When NOT to use circuit breaker:
- Single-call operations (overkill)
- Truly idempotent + cheap retries (just retry more)
- Operations that can't tolerate "circuit open" (no fallback exists)
For my system:
- Where to add breakers
- Fallback strategies
Output:
- Per-upstream circuit setup
- Thresholds
- Fallback per service
The pragmatic advice: **add circuit breakers to your top 3 third-party calls only**. Adding everywhere is overhead; adding to none is fragile. Pick the ones where outages are common + impact is high.
## Distinguishing Retryable Errors
Help me classify errors.
The classifier:
function isRetryable(err: unknown): boolean {
// Network errors (Node fetch / undici)
if (err instanceof Error) {
const code = (err as any).code;
if (['ECONNRESET', 'ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND'].includes(code)) {
return true;
}
if (err.name === 'AbortError') return false; // We aborted
if (err.name === 'TypeError') return false; // Bad input
}
// Response-based
if (err instanceof HttpError) {
const status = err.status;
if (status === 408) return true;
if (status === 425) return true;
if (status === 429) return true;
if (status >= 500 && status < 600) return true;
return false;
}
// Unknown — be conservative; don't retry
return false;
}
Errors you might THINK are retryable but aren't:
- 401 (token expired) — should refresh and retry, but with new token; not blind retry
- 409 (conflict) — usually means "your update was overwritten"; retry probably fails
- 422 (validation) — your data is wrong; retry won't help
Errors that look transient but require special handling:
- 503 with no Retry-After: backoff
- 429 in payment APIs: respect rate limit; this is critical
- 502 from CDN: probably the origin; check
- TLS errors: connection issue; retry
Custom error logic per API:
Stripe: wraps real codes in their own error codes. Their SDK does retry-classification for you. Use it.
OpenAI: 429s are common; respect rate-limit headers strictly.
For my integrations:
- Per-API quirks
- Custom classifier
Output:
- Classifier function
- Per-API customizations
The bug that surfaces in production: **retrying 401 / token-expired errors blindly**. Each retry has the same expired token. Either refresh-and-retry or give up; don't loop.
## Observing Retries
Help me observe retry behavior.
Metrics to emit:
Per-API:
- request_count{api, endpoint, status}
- request_duration_seconds{api, endpoint}
- retry_count{api, endpoint}
- retry_after_value{api, endpoint}
- circuit_breaker_state{api}
Aggregated:
- retry_storm: requests retried >3 times last hour
- failed_terminal: gave up after max attempts
- 5xx_rate
- 429_rate (rate-limited rate)
Alerts:
- retry_storm threshold (e.g. >10/min) → suspicious; investigate
- circuit_breaker_open → upstream down; alert
- 5xx_rate > 5% sustained → upstream issue
- 429_rate > 1% → you're being rate-limited; need higher tier or slow down
Logging discipline:
Don't log every retry at INFO; logs explode.
INFO: HTTP request {api, endpoint, status, duration}
WARN: HTTP retry {api, endpoint, attempt=2, status=503}
ERROR: HTTP failed terminal {api, endpoint, attempts=5}
INFO for normal; WARN for retries; ERROR for terminal failures.
Tracing:
Distributed tracing (OpenTelemetry) shows retries as spans:
parent span: charge_user
child span: stripe_charge_attempt_1 (FAILED)
child span: stripe_charge_attempt_2 (FAILED)
child span: stripe_charge_attempt_3 (SUCCEEDED)
This reveals retry storms and per-API timing without log spelunking.
For my stack: [observability tools]
Output:
- Metrics list
- Alert rules
- Tracing setup
The single most useful alert: **circuit-breaker-open**. When your code gives up on an upstream, somebody should know within minutes. Catches "the entire upstream is down" before customers notice.
## Common Retry Pitfalls
Help me avoid retry pitfalls.
The 10 mistakes:
1. No jitter Thundering herd; you DDoS upstream when it's already weak.
2. Retrying 4xx Wastes time; possibly costs money (paid APIs); fills logs.
3. No timeout per attempt Hung request never retries; total operation hangs.
4. No max attempts Retry forever scenarios.
5. No idempotency key on POST retries Double-charging customers.
6. Synchronous retry on user request path 30-second user wait while you retry 5 times; user already gave up.
7. Logging every retry attempt at INFO Log explosion masks real issues.
8. Reusing same retry policy across very different APIs Stripe (5s timeout fine) vs LLM-streaming (60s+) need different policies.
9. Ignoring Retry-After header Server says wait; you retry early; banned.
10. No circuit breaker on consistently-failing dependencies Hammer dead service; waste resources; cascade failures.
For my system: [risks]
Output:
- Top 3 risks
- Mitigations
- Tests to add
The mistake that bites at scale: **retrying everything synchronously on user-facing path**. User clicks → you retry 5 times → 30s later user sees error. Move retries to background queue; respond fast; retry async; notify on outcome.
## Tooling: Don't Roll Your Own
Help me pick a library.
Modern HTTP clients with built-in retry:
Node.js / TypeScript:
- ky — fetch wrapper; built-in retry; small bundle
- axios — popular; retry via axios-retry
- undici — Node native; built-in retry-on-error
- got — feature-rich
- opossum — circuit breaker
Python:
- httpx — modern; transport-level retry
- tenacity — generic retry decorator
- requests + urllib3 — adapter-based retry
- pybreaker — circuit breaker
Go:
- hashicorp/go-retryablehttp
- avast/retry-go
- sony/gobreaker — circuit breaker
Patterns to keep DIY:
- Idempotency-key generation + storage
- Per-API custom error classification
- Business-logic retry (e.g. "retry only if user still owns this resource")
Don't reinvent:
- Backoff math
- Jitter
- Circuit breaker state machine
These are well-trodden; use the library.
For my stack:
- Language
- HTTP client today
Output:
- Library pick
- Migration plan if rolling-your-own
- Wrapper for business logic
The 2026 default in TypeScript: **ky for HTTP + opossum for circuit breakers**. Combined: ~20KB; covers 90% of needs; saves writing 200 lines of retry boilerplate.
## What Done Looks Like
A working retry strategy delivers:
- Errors classified (retryable vs not) per upstream
- Exponential backoff with full jitter
- Retry-After / X-RateLimit-Reset headers respected
- Idempotency keys on POST/PUT/DELETE retries
- Per-attempt timeout + total timeout enforced
- Max attempts capped (3-5 typical)
- Circuit breaker on top-3 upstreams
- Fallback / fast-fail when circuit open
- Metrics: retry_count, circuit_state, terminal_failures
- Alerts: circuit-open, retry-storm, sustained-5xx
- Library used (not custom retry boilerplate)
- Test cases: simulated 503, 429, network errors, timeout
The proof you got it right: an upstream goes down for 5 minutes; your service gracefully degrades (queue retries, serve stale data, fast-fail) without DDoSing the upstream; recovers automatically when upstream returns. No customer-facing outage; alert fires; metrics show what happened.
## See Also
- [Idempotency Patterns](idempotency-patterns-chat.md) — the protection against double-execution
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — protection on YOUR endpoints
- [Webhook Signature Verification](webhook-signature-verification-chat.md) — companion delivery concern
- [Outbound Webhooks](outbound-webhooks-chat.md) — webhooks ARE retry from the sender side
- [Inbound Webhooks](inbound-webhooks-chat.md) — receiving webhooks; idempotency required
- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — log retries structured
- [Incident Response](incident-response-chat.md) — circuit-breaker-open triggers incident response
- [Service Level Agreements](service-level-agreements-chat.md) — your SLA depends on upstream availability
- [VibeReference: Background Jobs Providers](https://vibereference.dev/backend-and-data/background-jobs-providers) — async retry infrastructure
- [VibeReference: Observability Providers](https://vibereference.dev/devops-and-tools/observability-providers) — metrics + tracing