Background Jobs & Queue Management

⬅️ Day 6: Grow Overview

If you're building a B2B SaaS in 2026 with anything async — sending emails, processing uploads, generating reports, calling slow APIs, running AI inference, scheduled tasks, webhooks — you need background jobs / queues. The naive approach: setTimeout on the request handler; hope it finishes. The structured approach: durable queue (Vercel Queues / BullMQ / Inngest / Trigger.dev / Temporal), worker process, retries, dead letter queue, monitoring, idempotency. Background jobs are infrastructure that's invisible until it breaks; bad implementation = lost emails, half-processed orders, mysterious failures. (See cron-scheduled-tasks-chat.md for cron-specific; this is general async work.)

1. Decide queue platform

Pick queue / background job platform.

Options:

Vercel Queues (recommended for Vercel users):
- Built into Vercel (public beta as of 2026)
- At-least-once delivery
- Durable; integrated with Functions
- Pricing: per-message + Function compute
- See vercel:queues skill

Inngest:
- Modern serverless background jobs
- Step functions + workflows
- TypeScript-first
- Pricing: free tier; usage-based
- Strong for: complex workflows

Trigger.dev:
- OSS background jobs
- Hosted + self-host
- TypeScript-focused
- Pricing: free OSS + cloud paid

Temporal:
- Workflow orchestration
- Durable execution
- Complex but powerful
- Pricing: cloud + self-host options
- Strong for: enterprise long-running

BullMQ + Redis:
- Node.js library
- DIY queue management
- Free; you host Redis
- Strong for: simple custom queues

AWS SQS:
- AWS-native
- Cheap + reliable
- Less developer-friendly
- Strong for: AWS-native

Cloud Tasks (GCP):
- GCP-native
- Similar to SQS

QStash (Upstash):
- HTTP-based queue
- Serverless-friendly
- Webhook delivery focus

For 2026 stack:

Vercel-deployed app → Vercel Queues + Inngest hybrid
Other deploys → Inngest or Trigger.dev for modern; Temporal for complex

Decision factors:

Complexity:
- Simple "send email later" → SQS / QStash / Vercel Queues
- Multi-step workflow → Inngest / Trigger.dev
- Long-running orchestration → Temporal
- Real-time processing → BullMQ + Redis

Stack:
- Vercel → Vercel Queues
- Serverless-first → Inngest
- Self-host → Trigger.dev or BullMQ
- Cloud-native → SQS / Cloud Tasks

For [USE CASE], output:
1. Recommendation
2. Stack alignment
3. Cost estimate
4. Migration path
5. Operational complexity

The 2026 default for Next.js + Vercel: Vercel Queues for simple jobs + Inngest for workflows. Hybrid covers most cases.

2. Job patterns — what to queue

Identify what should be queued.

Should queue:

Slow operations (>1s):
- API calls to slow services
- Image / video processing
- Document parsing / OCR
- Email sending
- LLM inference (long ones)

Outbound webhooks:
- Don't block on customer-not-our-server
- Retry on failure

Bulk operations:
- Bulk email
- Bulk import (CSV)
- Mass notifications

Scheduled tasks:
- Daily digests
- Weekly reports
- Cron jobs

Multi-step workflows:
- Onboarding sequence
- Order fulfillment
- AI agent runs

Don't queue:

Sub-100ms operations:
- DB queries
- Simple validation
- In-memory operations

User-facing real-time:
- Login authentication
- Forms requiring immediate response
- Sync API calls

Trade-offs:

Synchronous:
- Pro: simple; user sees result immediately
- Con: blocks request; timeouts

Asynchronous:
- Pro: scalable; resilient
- Con: complexity; user UX (status updates)

Hybrid (queue + status):
- Submit job → return job ID
- User polls or subscribes for completion
- Standard for non-trivial jobs

For [USE CASE], output:
1. Operations to queue
2. Operations to keep sync
3. Hybrid candidates
4. UX for async (status / progress)
5. Failure handling

The "queue anything >1s" rule: requests that take >1 second risk timeout, frustration, retry storms. Queue + status response.

3. Job lifecycle — schema + states

Design job lifecycle.

States:

Pending:
- Job submitted; not yet processing
- In queue

Processing:
- Worker picked up
- Currently running

Completed:
- Successfully finished
- Result available

Failed:
- Failed beyond retries
- In dead letter queue

Retrying:
- Transient failure; will retry

Cancelled:
- User / system cancelled

Schema:

jobs table:
- id (UUID)
- type (string; 'send_email' / 'process_image' / etc.)
- payload (JSON)
- status (enum)
- created_at, updated_at
- started_at, completed_at, failed_at
- error_message
- retry_count, max_retries
- result (JSON; on completion)
- user_id, org_id (multi-tenant scoping)

Indexes:
- (status, created_at) for queue queries
- (user_id, status) for user queries

For Vercel Queues / Inngest:
- They manage state internally
- You query their API for status

For DIY (BullMQ):
- Manage in Redis or DB
- Query by job ID

Status API for users:

GET /api/jobs/:id
- Returns: status, progress, result (if done), error (if failed)

User UX:
- Show progress bar / spinner
- Email when done (long jobs)
- In-product notification

Output:
1. State machine
2. Schema
3. Status API
4. UX patterns
5. Cleanup (delete old completed jobs)

The cleanup rule: completed jobs accumulate. Delete after 30 days; keep failed for 90 days for debugging.

4. Retries + idempotency

Failures happen. Plan for them.

Implement retries + idempotency.

Retry strategies:

Exponential backoff:
- 1 sec, 4 sec, 16 sec, 64 sec...
- Cap at reasonable max (5 min)

Linear:
- 30 sec, 60 sec, 90 sec...
- Simpler; less optimal

Custom per error type:
- Network: retry quickly
- Rate limit: respect Retry-After
- 5xx: retry
- 4xx: don't retry (won't fix itself)

Max retries:

Standard: 3-5 retries
- After: dead letter queue
- Alert on DLQ

Per-operation:
- Critical: 10+ retries
- Best-effort: 1-2

Idempotency:

Why critical:
- Job might run twice (at-least-once delivery)
- Double-charge customer = bad
- Duplicate email = bad

Implementation:

Idempotency key:
- Each job has unique key
- DB constraint: unique on (type, idempotency_key)
- If duplicate: skip OR no-op

Operations to idempotent:

Charge customer:
- Stripe idempotency key
- Use job's idempotency key

Send email:
- Track sent emails by message ID
- Skip if already sent

Update record:
- Use UPSERT semantics
- Or: check current state before update

Side-effect operations:
- Webhooks: log + dedupe before sending
- Notifications: track sent

Anti-patterns:

Naive retry:
- Just retry; cause double-charges
- Need idempotency

No idempotency key:
- Hard to dedupe
- Add at job creation

Output:
1. Retry strategy per job type
2. Idempotency keys
3. Side-effect protection
4. DLQ + alerting
5. Test idempotency (run job 2x; assert no double-effect)

The Stripe idempotency-key pattern: API requests with key X are deduplicated server-side. Pass your job's idempotency key; Stripe handles the rest.

5. Dead letter queue (DLQ)

When jobs fail beyond retries.

Implement DLQ.

What goes to DLQ:

Permanently failed:
- Beyond max retries
- Non-retryable errors (4xx)
- Timeouts beyond limit

Process:

Job fails N times → DLQ
Alert: Slack + on-call
Investigate: error logs + payload
Action: manual rerun OR fix bug + rerun

Storage:

Same `jobs` table with status='failed'
Or: separate dlq_jobs table

Visibility:

Dashboard:
- Count of DLQ jobs
- Per-type breakdown
- Trend (rising = problem)

Alerting:

Threshold:
- >10 DLQ jobs in 1 hour → page
- >100 in 24 hours → page

Per-type:
- Email DLQ → email team
- Webhook DLQ → integration team

Replay:

Manual replay:
- Admin clicks "Retry" on DLQ job
- Useful after fixing bug

Bulk replay:
- "Retry all email DLQs"
- After deploy fix

Auto-replay:
- After successful test
- Risky; humans usually decide

Anti-patterns:

DLQ ignored:
- Failed jobs pile up
- Customer-facing failures unnoticed

No alerting:
- Failures invisible
- Discovered weeks later

No replay path:
- Failures permanent
- Lost work

Output:
1. DLQ implementation
2. Alerting thresholds
3. Investigation workflow
4. Replay mechanism
5. Cleanup policy

The DLQ-as-canary pattern: DLQ count is health signal. Spike = something broken. Daily review = catch issues early.

6. Concurrency + rate limiting

Don't overwhelm downstream.

Manage concurrency.

Worker pool:

Size:
- Match downstream capacity
- Too many workers: overwhelm DB / API
- Too few: slow processing

Per-job-type:
- Email: 50 concurrent (high; Resend handles)
- LLM calls: 10 concurrent (expensive; rate-limited)
- Webhooks: 100 concurrent (parallelizable)

Rate limiting per upstream:

Stripe: 100 req/sec
OpenAI: 60 req/min on tier 1
Twilio: 1 msg/sec (per phone)

Implementation:

In-job rate limiter:
- Token bucket
- Sleep / delay if quota exhausted

Or: queue throttling
- Process N jobs per second max
- BullMQ rate limiter
- Inngest concurrency controls

Backpressure:

If queue grows fast:
- Add more workers (auto-scale)
- Or: reject new jobs (temporary)
- Or: alert ops

Priority:

Multi-tier:
- High priority (paid customers): faster processing
- Low priority (free): can wait

Implementation:
- Separate queues per priority
- Workers prefer high-priority

Per-tenant fairness:

Avoid noisy neighbor:
- One customer's job shouldn't block others
- Per-tenant rate limit
- Round-robin across tenants

Output:
1. Worker pool sizing
2. Rate limiting
3. Backpressure
4. Priority
5. Multi-tenant fairness

The "per-tenant fairness" rule: one customer with 10K queued jobs shouldn't block other tenants' single jobs. Round-robin or weighted-fair scheduling.

7. Monitoring + observability

Background jobs are invisible without instrumentation.

Monitor background jobs.

Metrics:

Per-job-type:
- Throughput (jobs / minute)
- Success rate
- Failure rate
- Avg duration
- p50 / p95 / p99 duration

Queue health:
- Queue depth (pending jobs)
- Age of oldest job
- Stuck jobs (processing >2x avg)

Worker health:
- Active workers
- Idle workers
- Crashed workers

Trends:
- Hourly / daily volume
- Sudden spikes / drops
- Compare to baseline

Alerts:

Queue backup:
- Queue depth > 1000 → alert
- Oldest job > 1 hour → alert

Failure rate spike:
- Failure rate > 5% → alert
- Per-type sudden change → alert

Stuck jobs:
- Processing > 2x avg → investigate
- Worker not heartbeating → restart

Tools:

Built-in:
- Vercel dashboard for Vercel Queues
- Inngest dashboard for Inngest
- Bull Board for BullMQ

Integrations:
- Send metrics to Datadog / Grafana / New Relic
- Logs to Sentry / Datadog Logs
- Traces (OpenTelemetry)

Custom dashboard:
- BI tool (Looker / Mode)
- Per-team metrics

Anti-patterns:

No metrics:
- Failures invisible
- Performance unknown

Metric overload:
- 100 metrics; nobody looks
- 5-10 priority KPIs

Logs only:
- Hard to spot trends
- Metrics surface patterns

Output:
1. Metric framework
2. Alerting thresholds
3. Tooling
4. Dashboard
5. On-call playbook

The "queue depth as canary" rule: queue grows = workers can't keep up. Either scale up workers or fix slow jobs. Critical metric.

8. Status updates to users

Long jobs need user feedback.

Communicate job status.

Patterns:

Immediate response:
- Job submitted; return job ID
- "Processing your request..."

Polling:
- Client polls status endpoint
- 1-5 sec interval
- Stop on completion

WebSocket / SSE:
- Server pushes updates
- Real-time progress
- For sub-second updates

Webhook (back to user's system):
- For customer integrations
- "Notify me when done at this URL"

Email when done:
- For long jobs (>30 sec)
- "We'll email you when ready"

Progress UI:

Spinner:
- Indeterminate; "working..."
- For unknown duration

Progress bar:
- "27 of 100 items processed"
- For known progress

Step indicator:
- "Step 2 of 5: Validating data"
- For multi-step

Time estimate:
- "~2 minutes remaining"
- For predictable

Cancellation:

Cancel button:
- For long-running
- Mark job as cancelled
- Worker checks; aborts gracefully

Constraints:
- Some jobs uncancellable (already partially side-effected)
- Be explicit

Anti-patterns:

Silent processing:
- User waits; no feedback
- Frustration; abandons

Misleading progress:
- "99%" stuck for minutes
- Worse than no progress

No completion signal:
- User wonders if done
- Always notify

Output:
1. Status communication pattern per job
2. UI per pattern
3. Cancellation handling
4. Notifications (email / in-app)
5. Long-job UX (background + email when done)

The "30-second rule" for UX: jobs >30s should show progress + offer "we'll email you." Anything else feels broken.

9. Local development + testing

Background jobs are hard to debug locally.

Background jobs in local dev.

Options:

Vercel Queues:
- vercel dev runs locally
- Queues simulated locally
- Limited (single worker; no real queue)

Inngest:
- Inngest dev server
- Local UI for inspecting jobs
- Excellent dev experience

Trigger.dev:
- Local dev server
- Hot reload of jobs

BullMQ + Redis:
- Run Redis locally
- Worker process locally
- More setup; full feature

Mock:
- Tests: mock queue
- Don't actually run jobs
- Run job logic synchronously

Testing:

Unit tests:
- Test job logic in isolation
- Mock external calls
- Fast

Integration tests:
- Submit job; wait for completion
- Test with real queue (test mode)
- Slower but catches integration bugs

Test fixtures:

Test data:
- Sample payloads
- Seeded DB state

Cleanup:
- Reset queue between tests
- Reset DB

Debugging tips:

Log payload + result:
- Always log job inputs + outputs
- Easier to reproduce

Replay capability:
- Save failed payloads
- Replay locally to debug

Inspect tool:
- Inngest / Trigger.dev / BullBoard show queue state
- Critical for debugging

Output:
1. Local dev setup
2. Testing strategy
3. Mock approaches
4. Logging / debugging
5. Replay workflow

The Inngest dev server: best-in-class for local debugging. Visual inspector; replay; step-through. If you use Inngest, install + use it.

10. Migration from sync to async

Existing sync code → async background.

Migrate sync to async.

Steps:

1. Identify candidate operations
- Slow >1s
- Side-effecting (emails, webhooks)
- Bulk

2. Wrap in job
- Same logic; queue submission
- Return job ID instead of result

3. Update API
- Was: synchronous response
- Becomes: job ID + status URL

4. Update frontend
- Show "processing" state
- Poll or subscribe for completion

5. Migrate gradually
- Feature flag: sync vs async
- Roll out to subset
- Compare performance

6. Cut over
- All users on async
- Remove sync path

Compatibility:

Backward compat:
- Old API still works (returns sync; wraps async internally)
- New API explicit async
- Sunset old after migration

API design:

Sync API (legacy):
POST /api/process
Response: result (after wait)

Async API (new):
POST /api/process
Response: { job_id, status_url }

GET /api/jobs/:id
Response: { status, result?, error? }

Webhook (advanced):
Customer registers webhook URL
We POST result on completion

Anti-patterns:

Big-bang migration:
- Risky; long downtime potential
- Gradual + feature flag better

No backward compat:
- Breaks existing customers
- Always provide path

Output:
1. Migration plan
2. API design (sync + async)
3. Feature flag strategy
4. Rollout plan
5. Sunset timeline

The "feature flag + gradual rollout" pattern: 1% of users on async; monitor; 10%; 50%; 100%. Catches issues early.

What Done Looks Like

A v1 background job system for B2B SaaS in 2026:

Add later when product is mature:

The mistake to avoid: synchronous slow operations. Block requests; timeout; bad UX.

The second mistake: no idempotency. Retries cause double-charges / double-emails.

The third mistake: DLQ ignored. Failures pile up; customers affected silently.