HTTP Retry & Backoff: Make Your Third-Party API Calls Survive Without DDoS-ing Anyone

⬅️ Day 6: Grow Overview

If your SaaS calls Stripe, Resend, OpenAI, Twilio, or any third-party API in 2026, your code WILL hit transient failures — rate limits, 502s, network blips, brief downtime. Naive code crashes; slightly-less-naive code retries instantly (and gets banned for hammering); right code uses exponential backoff with jitter, classifies errors, respects Retry-After headers, and gives up after a sensible cap. Most indie SaaS gets this wrong and silently loses 0.5-3% of operations to transient errors that should have succeeded. Worse: aggressive retry logic gets the API to ban your IP or rate-limit your account.

A working retry strategy answers: which errors are retryable (5xx and network errors yes; 4xx mostly no), how many attempts (3-5 typical), how long between (exponential backoff with jitter), what's the timeout per attempt and total, what's the idempotency strategy (don't double-charge), and how do you observe (metrics + alerts on retry storms).

This guide is the implementation playbook for retry / backoff / circuit breaker patterns. Companion to Idempotency Patterns, Rate Limiting & Abuse, Webhook Signature Verification, Outbound Webhooks, and Background Jobs.

Why Retry Matters

Get the failure modes clear first.

Help me understand transient failures.

The categories of "should retry" failures:

**1. Network errors**
- Connection reset / refused
- Timeout (connect or read)
- DNS resolution failure
- TLS handshake failure
Retryable: YES. Almost always transient.

**2. HTTP 5xx**
- 500 Internal Server Error: server bug or transient state
- 502 Bad Gateway: upstream issue
- 503 Service Unavailable: overloaded/maintenance
- 504 Gateway Timeout: upstream timed out
Retryable: YES, with backoff.

**3. HTTP 429 Too Many Requests**
- Rate limit hit
Retryable: YES, but RESPECT Retry-After header. No backoff = ban.

**4. HTTP 408 Request Timeout**
- Server didn't get full request
Retryable: YES; assume idempotency.

**5. Connection drops mid-response**
- TCP RST during data
- Body truncation
Retryable: YES (idempotency-key required).

**The "do NOT retry" failures**:

- 400 Bad Request: your request is wrong; retry won't fix
- 401 Unauthorized: auth issue; retry won't help
- 403 Forbidden: permission issue
- 404 Not Found: doesn't exist
- 422 Unprocessable: validation failed
- Logic errors in your code

The "ambiguous, situation-dependent":

- 409 Conflict: maybe retry (optimistic concurrency); often won't help
- 423 Locked: maybe wait + retry
- 425 Too Early: yes retry

For my app:
- Which APIs / endpoints
- What error rate you see today

Output:
1. Error classification per API
2. Retry policy per category
3. Test cases needed

The biggest unforced error: retrying 4xx errors. They won't succeed; you waste latency, log spam, and possibly money (paid API calls). Classify carefully.

Exponential Backoff with Jitter

Help me get the math right.

The naive (wrong) approach:

```typescript
for (let i = 0; i < 3; i++) {
  try {
    return await fetch(url);
  } catch (e) {
    await sleep(1000); // Fixed delay
  }
}

Two problems: (1) immediate retry is too aggressive; (2) no jitter = thundering herd on shared failure.

The right approach: exponential with jitter:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  options = {
    maxAttempts: 5,
    baseDelayMs: 200,
    maxDelayMs: 30000,
    jitter: 'full', // 'full' | 'equal' | 'none'
  }
): Promise<T> {
  let lastError: unknown;
  for (let attempt = 0; attempt < options.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      if (!isRetryable(err) || attempt === options.maxAttempts - 1) throw err;
      
      const exponential = Math.min(
        options.baseDelayMs * Math.pow(2, attempt),
        options.maxDelayMs
      );
      
      let delay: number;
      switch (options.jitter) {
        case 'full':
          delay = Math.random() * exponential;
          break;
        case 'equal':
          delay = exponential / 2 + Math.random() * (exponential / 2);
          break;
        case 'none':
          delay = exponential;
          break;
      }
      
      await sleep(delay);
    }
  }
  throw lastError;
}

The progression (baseDelay=200ms, jitter=full):

Attempt 0 fails → wait 0-200ms → retry Attempt 1 fails → wait 0-400ms → retry Attempt 2 fails → wait 0-800ms → retry Attempt 3 fails → wait 0-1600ms → retry Attempt 4 fails → throw

Why jitter (the AWS thundering-herd insight):

Without jitter: 1000 clients hit a flaky service; all fail; all retry exactly 1s later; service crushed again; everyone fails again; retry exactly 2s later; same.

With jitter: failures spread across the wait window. Service recovers gracefully.

Three jitter modes:

Full (0 to base × 2^attempt) — most spread; recommended default
Equal (half base + random half) — guarantees minimum delay
None — no jitter; only if your traffic is rare/serial

Rule of thumb: full jitter for high-volume; equal jitter for low-volume.

The cap (max delay):

Without cap: attempt 10 = 200ms × 2^10 = 200s wait. Probably gave up by then. Default cap: 30s. Anything longer is unrecoverable from caller perspective.

For my code:

Languages
HTTP client

Output:

Backoff function
Jitter mode pick
Defaults table (maxAttempts, baseDelay, maxDelay)


The single most-impactful detail: **jitter**. Many engineers skip it because the math feels arbitrary. It isn't — jitter is what saves your upstream from cascading failures. Always jitter.

## Respecting Retry-After Headers

Help me handle Retry-After.

Servers communicate retry timing via headers:

Retry-After: 60 (seconds) Retry-After: Mon, 30 Apr 2026 10:00:00 GMT (date) X-RateLimit-Reset: 1714478400 (Stripe-style; unix seconds)

Implementation:

function parseRetryAfter(headers: Headers): number | null {
  // Standard Retry-After
  const retryAfter = headers.get('retry-after');
  if (retryAfter) {
    const seconds = parseInt(retryAfter);
    if (!isNaN(seconds)) return seconds * 1000;
    
    // Date format
    const date = new Date(retryAfter);
    if (!isNaN(date.getTime())) {
      return Math.max(0, date.getTime() - Date.now());
    }
  }
  
  // Stripe-style
  const reset = headers.get('x-ratelimit-reset');
  if (reset) {
    return Math.max(0, parseInt(reset) * 1000 - Date.now());
  }
  
  return null;
}

async function retry(fn: () => Promise<Response>) {
  for (let attempt = 0; attempt < 5; attempt++) {
    const resp = await fn();
    if (resp.ok) return resp;
    
    if (resp.status === 429 || resp.status === 503) {
      const retryAfter = parseRetryAfter(resp.headers);
      const delay = retryAfter ?? exponentialBackoff(attempt);
      await sleep(delay);
      continue;
    }
    
    if (resp.status >= 500 && resp.status < 600) {
      await sleep(exponentialBackoff(attempt));
      continue;
    }
    
    return resp; // Non-retryable
  }
}

The rule: if server says "wait N seconds," wait N seconds. Don't second-guess.

Stripe gives Retry-After on 429s; OpenAI gives it via x-ratelimit-* headers; AWS uses Retry-After or X-Amz-Retry-After.

The cap for Retry-After:

Sometimes server says "wait 6 hours." Probably you should give up, not wait.

const MAX_RETRY_AFTER_MS = 60_000; // 1 minute
const delay = Math.min(retryAfter, MAX_RETRY_AFTER_MS);
if (retryAfter > MAX_RETRY_AFTER_MS) {
  // Don't retry; raise to caller / queue for later
  throw new RetryDelayTooLongError();
}

Decision per use case: short wait → retry; long wait → defer.

For my integrations: [Stripe, OpenAI, etc.]

Output:

Per-API header parsing
Cap rules
Defer-to-queue handling


The mistake to avoid: **ignoring Retry-After and using your own backoff**. The server knows when capacity will recover; you don't. Trust the header.

## Idempotency: Retry Without Double-Charging

Help me make retries safe.

The problem: retried request may have already succeeded — you just didn't see the response.

Example: charge $100. Network drops mid-response. Retry. Did you charge $100 once or twice?

The solution: idempotency keys.

Pattern (Stripe / modern APIs):

const idempotencyKey = uuidv4(); // Generated ONCE at operation start

async function chargeWithRetry(amount: number) {
  return retryWithBackoff(() =>
    fetch('https://api.stripe.com/v1/charges', {
      method: 'POST',
      headers: {
        'Idempotency-Key': idempotencyKey, // SAME on every retry
        Authorization: `Bearer ${STRIPE_KEY}`,
      },
      body: new URLSearchParams({ amount: String(amount), currency: 'usd' }),
    })
  );
}

The server uses idempotency key to recognize "this is the same operation" and return the original result on retry.

The discipline:

Generate idempotency key at the top of the operation (NOT per retry)
Persist the key (so if your service restarts mid-retry, you keep using the same key)
Use the key as long as you want "same operation" semantics (typically 24h server-side)

Storage pattern:

CREATE TABLE pending_charges (
  id UUID PRIMARY KEY,
  idempotency_key UUID NOT NULL UNIQUE,
  amount INT NOT NULL,
  status VARCHAR(20),
  attempt_count INT DEFAULT 0,
  last_attempt_at TIMESTAMPTZ,
  result_json JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

On retry: check status; if succeeded, return cached result; if pending, retry with same key; if failed_terminal, give up.

For YOUR API:

If you offer an API, accept Idempotency-Key headers. Save callers from this pain.

For my integrations:

Which APIs support idempotency?
Storage of pending operations

Output:

Idempotency-key strategy per API
Storage schema
Recovery patterns


The unforgiving lesson: **double-charge bugs are unforgivable**. One retry that succeeds twice = customer support ticket + chargeback + bad review. Idempotency keys are the only defense; use them anywhere money / state changes.

## Timeouts (the Other Half)

Help me set timeouts.

Without timeouts, "retry" is meaningless — a hung request waits forever.

Two timeouts per request:

1. Connection timeout (how long to establish TCP/TLS):

Default: 10s
Tighter for local-net: 1-3s
Looser for high-latency endpoints: 30s

2. Read timeout / total (how long to receive response):

Default: 30s
Stripe: 80s default
OpenAI streaming: 120s+ acceptable
Critical user-facing: 5-10s

The HTTP client setup:

Node.js fetch:

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30_000);

try {
  const resp = await fetch(url, { signal: controller.signal });
  return resp;
} finally {
  clearTimeout(timeoutId);
}

Axios:

axios.get(url, { timeout: 30000 });

ky / undici:

ky.get(url, { timeout: 30000 });

Total timeout (across retries):

If user-facing operation has 30s budget and per-attempt timeout is 30s, you can't retry. So:

Per-attempt timeout: 5s
Max attempts: 3
Backoff between: up to 1s + 2s = 3s
Total worst case: 5+1+5+2+5 = 18s

This fits under a 30s user budget.

For background work (no user waiting):

Per-attempt: 30-60s
Max attempts: 5-10
Total: minutes-hours OK

Cancellation propagation:

If user closes browser, your server should cancel upstream requests:

// Vercel Functions / Next.js
export async function POST(req: Request) {
  const upstreamCtrl = new AbortController();
  
  req.signal.addEventListener('abort', () => upstreamCtrl.abort());
  
  return await fetch(upstream, { signal: upstreamCtrl.signal });
}

Saves wasted work when user gives up.

For my requests:

User-facing budget per call
Background acceptable timeouts

Output:

Timeout config per type of call
Total timeout math
Cancellation pattern


The bug most teams ship: **no timeout = thread / connection leak**. One slow upstream = pile of stuck requests. Eventually OOM or pool exhaustion. Always set both timeouts.

## Circuit Breaker: When to Stop Trying

Help me set up circuit breaker.

The idea: if the upstream is consistently failing, stop hammering it. Give it a chance to recover.

The pattern:

Closed (normal)
  ↓ failures pile up (e.g. 10 failures in 60s)
Open (block all requests; fail fast)
  ↓ wait period (30-60s)
Half-Open (allow 1 trial request)
  ↓ trial succeeds → Closed
  ↓ trial fails → back to Open

Implementation (using opossum or similar):

import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(callStripeAPI, {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000, // 30s in Open state
  rollingCountTimeout: 10000, // 10s window for stats
});

breaker.on('open', () => console.log('Circuit OPEN'));
breaker.on('halfOpen', () => console.log('Trial request'));
breaker.on('close', () => console.log('Circuit CLOSED'));

await breaker.fire(args);

Per-upstream circuit breakers:

Don't share one breaker across all third parties. If Stripe is down, Resend is fine.

const stripeBreaker = new CircuitBreaker(stripeCall, options);
const resendBreaker = new CircuitBreaker(resendCall, options);

When circuit is open: fast-fail or fallback:

Fast-fail: throw error; let caller decide
Fallback: return cached data; queue for later; degrade gracefully

For email: queue + retry later. For payments: tell user "try again in a moment" — don't fall back. For a recommendation API: serve cached / generic.

When NOT to use circuit breaker:

Single-call operations (overkill)
Truly idempotent + cheap retries (just retry more)
Operations that can't tolerate "circuit open" (no fallback exists)

For my system:

Where to add breakers
Fallback strategies

Output:

Per-upstream circuit setup
Thresholds
Fallback per service


The pragmatic advice: **add circuit breakers to your top 3 third-party calls only**. Adding everywhere is overhead; adding to none is fragile. Pick the ones where outages are common + impact is high.

## Distinguishing Retryable Errors

Help me classify errors.

The classifier:

function isRetryable(err: unknown): boolean {
  // Network errors (Node fetch / undici)
  if (err instanceof Error) {
    const code = (err as any).code;
    if (['ECONNRESET', 'ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND'].includes(code)) {
      return true;
    }
    if (err.name === 'AbortError') return false; // We aborted
    if (err.name === 'TypeError') return false; // Bad input
  }
  
  // Response-based
  if (err instanceof HttpError) {
    const status = err.status;
    if (status === 408) return true;
    if (status === 425) return true;
    if (status === 429) return true;
    if (status >= 500 && status < 600) return true;
    return false;
  }
  
  // Unknown — be conservative; don't retry
  return false;
}

Errors you might THINK are retryable but aren't:

401 (token expired) — should refresh and retry, but with new token; not blind retry
409 (conflict) — usually means "your update was overwritten"; retry probably fails
422 (validation) — your data is wrong; retry won't help

Errors that look transient but require special handling:

503 with no Retry-After: backoff
429 in payment APIs: respect rate limit; this is critical
502 from CDN: probably the origin; check
TLS errors: connection issue; retry

Custom error logic per API:

Stripe: wraps real codes in their own error codes. Their SDK does retry-classification for you. Use it.

OpenAI: 429s are common; respect rate-limit headers strictly.

For my integrations:

Per-API quirks
Custom classifier

Output:

Classifier function
Per-API customizations


The bug that surfaces in production: **retrying 401 / token-expired errors blindly**. Each retry has the same expired token. Either refresh-and-retry or give up; don't loop.

## Observing Retries

Help me observe retry behavior.

Metrics to emit:

Per-API:

request_count{api, endpoint, status}
request_duration_seconds{api, endpoint}
retry_count{api, endpoint}
retry_after_value{api, endpoint}
circuit_breaker_state{api}

Aggregated:

retry_storm: requests retried >3 times last hour
failed_terminal: gave up after max attempts
5xx_rate
429_rate (rate-limited rate)

Alerts:

retry_storm threshold (e.g. >10/min) → suspicious; investigate
circuit_breaker_open → upstream down; alert
5xx_rate > 5% sustained → upstream issue
429_rate > 1% → you're being rate-limited; need higher tier or slow down

Logging discipline:

Don't log every retry at INFO; logs explode.

INFO: HTTP request {api, endpoint, status, duration}
WARN: HTTP retry {api, endpoint, attempt=2, status=503}
ERROR: HTTP failed terminal {api, endpoint, attempts=5}

INFO for normal; WARN for retries; ERROR for terminal failures.

Tracing:

Distributed tracing (OpenTelemetry) shows retries as spans:

parent span: charge_user
  child span: stripe_charge_attempt_1 (FAILED)
  child span: stripe_charge_attempt_2 (FAILED)
  child span: stripe_charge_attempt_3 (SUCCEEDED)

This reveals retry storms and per-API timing without log spelunking.

For my stack: [observability tools]

Output:

Metrics list
Alert rules
Tracing setup


The single most useful alert: **circuit-breaker-open**. When your code gives up on an upstream, somebody should know within minutes. Catches "the entire upstream is down" before customers notice.

## Common Retry Pitfalls

Help me avoid retry pitfalls.

The 10 mistakes:

1. No jitter Thundering herd; you DDoS upstream when it's already weak.

2. Retrying 4xx Wastes time; possibly costs money (paid APIs); fills logs.

3. No timeout per attempt Hung request never retries; total operation hangs.

4. No max attempts Retry forever scenarios.

5. No idempotency key on POST retries Double-charging customers.

6. Synchronous retry on user request path 30-second user wait while you retry 5 times; user already gave up.

7. Logging every retry attempt at INFO Log explosion masks real issues.

8. Reusing same retry policy across very different APIs Stripe (5s timeout fine) vs LLM-streaming (60s+) need different policies.

9. Ignoring Retry-After header Server says wait; you retry early; banned.

10. No circuit breaker on consistently-failing dependencies Hammer dead service; waste resources; cascade failures.

For my system: [risks]

Output:

Top 3 risks
Mitigations
Tests to add


The mistake that bites at scale: **retrying everything synchronously on user-facing path**. User clicks → you retry 5 times → 30s later user sees error. Move retries to background queue; respond fast; retry async; notify on outcome.

## Tooling: Don't Roll Your Own

Help me pick a library.

Modern HTTP clients with built-in retry:

Node.js / TypeScript:

ky — fetch wrapper; built-in retry; small bundle
axios — popular; retry via axios-retry
undici — Node native; built-in retry-on-error
got — feature-rich
opossum — circuit breaker

Python:

httpx — modern; transport-level retry
tenacity — generic retry decorator
requests + urllib3 — adapter-based retry
pybreaker — circuit breaker

Go:

hashicorp/go-retryablehttp
avast/retry-go
sony/gobreaker — circuit breaker

Patterns to keep DIY:

Idempotency-key generation + storage
Per-API custom error classification
Business-logic retry (e.g. "retry only if user still owns this resource")

Don't reinvent:

Backoff math
Jitter
Circuit breaker state machine

These are well-trodden; use the library.

For my stack:

Language
HTTP client today

Output:

Library pick
Migration plan if rolling-your-own
Wrapper for business logic


The 2026 default in TypeScript: **ky for HTTP + opossum for circuit breakers**. Combined: ~20KB; covers 90% of needs; saves writing 200 lines of retry boilerplate.

## What Done Looks Like

A working retry strategy delivers:
- Errors classified (retryable vs not) per upstream
- Exponential backoff with full jitter
- Retry-After / X-RateLimit-Reset headers respected
- Idempotency keys on POST/PUT/DELETE retries
- Per-attempt timeout + total timeout enforced
- Max attempts capped (3-5 typical)
- Circuit breaker on top-3 upstreams
- Fallback / fast-fail when circuit open
- Metrics: retry_count, circuit_state, terminal_failures
- Alerts: circuit-open, retry-storm, sustained-5xx
- Library used (not custom retry boilerplate)
- Test cases: simulated 503, 429, network errors, timeout

The proof you got it right: an upstream goes down for 5 minutes; your service gracefully degrades (queue retries, serve stale data, fast-fail) without DDoSing the upstream; recovers automatically when upstream returns. No customer-facing outage; alert fires; metrics show what happened.

## See Also

- [Idempotency Patterns](idempotency-patterns-chat.md) — the protection against double-execution
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — protection on YOUR endpoints
- [Webhook Signature Verification](webhook-signature-verification-chat.md) — companion delivery concern
- [Outbound Webhooks](outbound-webhooks-chat.md) — webhooks ARE retry from the sender side
- [Inbound Webhooks](inbound-webhooks-chat.md) — receiving webhooks; idempotency required
- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — log retries structured
- [Incident Response](incident-response-chat.md) — circuit-breaker-open triggers incident response
- [Service Level Agreements](service-level-agreements-chat.md) — your SLA depends on upstream availability
- [VibeReference: Background Jobs Providers](https://vibereference.dev/backend-and-data/background-jobs-providers) — async retry infrastructure
- [VibeReference: Observability Providers](https://vibereference.dev/devops-and-tools/observability-providers) — metrics + tracing