Background Jobs & Queue Management
If you're building a B2B SaaS in 2026 with anything async — sending emails, processing uploads, generating reports, calling slow APIs, running AI inference, scheduled tasks, webhooks — you need background jobs / queues. The naive approach: setTimeout on the request handler; hope it finishes. The structured approach: durable queue (Vercel Queues / BullMQ / Inngest / Trigger.dev / Temporal), worker process, retries, dead letter queue, monitoring, idempotency. Background jobs are infrastructure that's invisible until it breaks; bad implementation = lost emails, half-processed orders, mysterious failures. (See cron-scheduled-tasks-chat.md for cron-specific; this is general async work.)
1. Decide queue platform
Pick queue / background job platform.
Options:
Vercel Queues (recommended for Vercel users):
- Built into Vercel (public beta as of 2026)
- At-least-once delivery
- Durable; integrated with Functions
- Pricing: per-message + Function compute
- See vercel:queues skill
Inngest:
- Modern serverless background jobs
- Step functions + workflows
- TypeScript-first
- Pricing: free tier; usage-based
- Strong for: complex workflows
Trigger.dev:
- OSS background jobs
- Hosted + self-host
- TypeScript-focused
- Pricing: free OSS + cloud paid
Temporal:
- Workflow orchestration
- Durable execution
- Complex but powerful
- Pricing: cloud + self-host options
- Strong for: enterprise long-running
BullMQ + Redis:
- Node.js library
- DIY queue management
- Free; you host Redis
- Strong for: simple custom queues
AWS SQS:
- AWS-native
- Cheap + reliable
- Less developer-friendly
- Strong for: AWS-native
Cloud Tasks (GCP):
- GCP-native
- Similar to SQS
QStash (Upstash):
- HTTP-based queue
- Serverless-friendly
- Webhook delivery focus
For 2026 stack:
Vercel-deployed app → Vercel Queues + Inngest hybrid
Other deploys → Inngest or Trigger.dev for modern; Temporal for complex
Decision factors:
Complexity:
- Simple "send email later" → SQS / QStash / Vercel Queues
- Multi-step workflow → Inngest / Trigger.dev
- Long-running orchestration → Temporal
- Real-time processing → BullMQ + Redis
Stack:
- Vercel → Vercel Queues
- Serverless-first → Inngest
- Self-host → Trigger.dev or BullMQ
- Cloud-native → SQS / Cloud Tasks
For [USE CASE], output:
1. Recommendation
2. Stack alignment
3. Cost estimate
4. Migration path
5. Operational complexity
The 2026 default for Next.js + Vercel: Vercel Queues for simple jobs + Inngest for workflows. Hybrid covers most cases.
2. Job patterns — what to queue
Identify what should be queued.
Should queue:
Slow operations (>1s):
- API calls to slow services
- Image / video processing
- Document parsing / OCR
- Email sending
- LLM inference (long ones)
Outbound webhooks:
- Don't block on customer-not-our-server
- Retry on failure
Bulk operations:
- Bulk email
- Bulk import (CSV)
- Mass notifications
Scheduled tasks:
- Daily digests
- Weekly reports
- Cron jobs
Multi-step workflows:
- Onboarding sequence
- Order fulfillment
- AI agent runs
Don't queue:
Sub-100ms operations:
- DB queries
- Simple validation
- In-memory operations
User-facing real-time:
- Login authentication
- Forms requiring immediate response
- Sync API calls
Trade-offs:
Synchronous:
- Pro: simple; user sees result immediately
- Con: blocks request; timeouts
Asynchronous:
- Pro: scalable; resilient
- Con: complexity; user UX (status updates)
Hybrid (queue + status):
- Submit job → return job ID
- User polls or subscribes for completion
- Standard for non-trivial jobs
For [USE CASE], output:
1. Operations to queue
2. Operations to keep sync
3. Hybrid candidates
4. UX for async (status / progress)
5. Failure handling
The "queue anything >1s" rule: requests that take >1 second risk timeout, frustration, retry storms. Queue + status response.
3. Job lifecycle — schema + states
Design job lifecycle.
States:
Pending:
- Job submitted; not yet processing
- In queue
Processing:
- Worker picked up
- Currently running
Completed:
- Successfully finished
- Result available
Failed:
- Failed beyond retries
- In dead letter queue
Retrying:
- Transient failure; will retry
Cancelled:
- User / system cancelled
Schema:
jobs table:
- id (UUID)
- type (string; 'send_email' / 'process_image' / etc.)
- payload (JSON)
- status (enum)
- created_at, updated_at
- started_at, completed_at, failed_at
- error_message
- retry_count, max_retries
- result (JSON; on completion)
- user_id, org_id (multi-tenant scoping)
Indexes:
- (status, created_at) for queue queries
- (user_id, status) for user queries
For Vercel Queues / Inngest:
- They manage state internally
- You query their API for status
For DIY (BullMQ):
- Manage in Redis or DB
- Query by job ID
Status API for users:
GET /api/jobs/:id
- Returns: status, progress, result (if done), error (if failed)
User UX:
- Show progress bar / spinner
- Email when done (long jobs)
- In-product notification
Output:
1. State machine
2. Schema
3. Status API
4. UX patterns
5. Cleanup (delete old completed jobs)
The cleanup rule: completed jobs accumulate. Delete after 30 days; keep failed for 90 days for debugging.
4. Retries + idempotency
Failures happen. Plan for them.
Implement retries + idempotency.
Retry strategies:
Exponential backoff:
- 1 sec, 4 sec, 16 sec, 64 sec...
- Cap at reasonable max (5 min)
Linear:
- 30 sec, 60 sec, 90 sec...
- Simpler; less optimal
Custom per error type:
- Network: retry quickly
- Rate limit: respect Retry-After
- 5xx: retry
- 4xx: don't retry (won't fix itself)
Max retries:
Standard: 3-5 retries
- After: dead letter queue
- Alert on DLQ
Per-operation:
- Critical: 10+ retries
- Best-effort: 1-2
Idempotency:
Why critical:
- Job might run twice (at-least-once delivery)
- Double-charge customer = bad
- Duplicate email = bad
Implementation:
Idempotency key:
- Each job has unique key
- DB constraint: unique on (type, idempotency_key)
- If duplicate: skip OR no-op
Operations to idempotent:
Charge customer:
- Stripe idempotency key
- Use job's idempotency key
Send email:
- Track sent emails by message ID
- Skip if already sent
Update record:
- Use UPSERT semantics
- Or: check current state before update
Side-effect operations:
- Webhooks: log + dedupe before sending
- Notifications: track sent
Anti-patterns:
Naive retry:
- Just retry; cause double-charges
- Need idempotency
No idempotency key:
- Hard to dedupe
- Add at job creation
Output:
1. Retry strategy per job type
2. Idempotency keys
3. Side-effect protection
4. DLQ + alerting
5. Test idempotency (run job 2x; assert no double-effect)
The Stripe idempotency-key pattern: API requests with key X are deduplicated server-side. Pass your job's idempotency key; Stripe handles the rest.
5. Dead letter queue (DLQ)
When jobs fail beyond retries.
Implement DLQ.
What goes to DLQ:
Permanently failed:
- Beyond max retries
- Non-retryable errors (4xx)
- Timeouts beyond limit
Process:
Job fails N times → DLQ
Alert: Slack + on-call
Investigate: error logs + payload
Action: manual rerun OR fix bug + rerun
Storage:
Same `jobs` table with status='failed'
Or: separate dlq_jobs table
Visibility:
Dashboard:
- Count of DLQ jobs
- Per-type breakdown
- Trend (rising = problem)
Alerting:
Threshold:
- >10 DLQ jobs in 1 hour → page
- >100 in 24 hours → page
Per-type:
- Email DLQ → email team
- Webhook DLQ → integration team
Replay:
Manual replay:
- Admin clicks "Retry" on DLQ job
- Useful after fixing bug
Bulk replay:
- "Retry all email DLQs"
- After deploy fix
Auto-replay:
- After successful test
- Risky; humans usually decide
Anti-patterns:
DLQ ignored:
- Failed jobs pile up
- Customer-facing failures unnoticed
No alerting:
- Failures invisible
- Discovered weeks later
No replay path:
- Failures permanent
- Lost work
Output:
1. DLQ implementation
2. Alerting thresholds
3. Investigation workflow
4. Replay mechanism
5. Cleanup policy
The DLQ-as-canary pattern: DLQ count is health signal. Spike = something broken. Daily review = catch issues early.
6. Concurrency + rate limiting
Don't overwhelm downstream.
Manage concurrency.
Worker pool:
Size:
- Match downstream capacity
- Too many workers: overwhelm DB / API
- Too few: slow processing
Per-job-type:
- Email: 50 concurrent (high; Resend handles)
- LLM calls: 10 concurrent (expensive; rate-limited)
- Webhooks: 100 concurrent (parallelizable)
Rate limiting per upstream:
Stripe: 100 req/sec
OpenAI: 60 req/min on tier 1
Twilio: 1 msg/sec (per phone)
Implementation:
In-job rate limiter:
- Token bucket
- Sleep / delay if quota exhausted
Or: queue throttling
- Process N jobs per second max
- BullMQ rate limiter
- Inngest concurrency controls
Backpressure:
If queue grows fast:
- Add more workers (auto-scale)
- Or: reject new jobs (temporary)
- Or: alert ops
Priority:
Multi-tier:
- High priority (paid customers): faster processing
- Low priority (free): can wait
Implementation:
- Separate queues per priority
- Workers prefer high-priority
Per-tenant fairness:
Avoid noisy neighbor:
- One customer's job shouldn't block others
- Per-tenant rate limit
- Round-robin across tenants
Output:
1. Worker pool sizing
2. Rate limiting
3. Backpressure
4. Priority
5. Multi-tenant fairness
The "per-tenant fairness" rule: one customer with 10K queued jobs shouldn't block other tenants' single jobs. Round-robin or weighted-fair scheduling.
7. Monitoring + observability
Background jobs are invisible without instrumentation.
Monitor background jobs.
Metrics:
Per-job-type:
- Throughput (jobs / minute)
- Success rate
- Failure rate
- Avg duration
- p50 / p95 / p99 duration
Queue health:
- Queue depth (pending jobs)
- Age of oldest job
- Stuck jobs (processing >2x avg)
Worker health:
- Active workers
- Idle workers
- Crashed workers
Trends:
- Hourly / daily volume
- Sudden spikes / drops
- Compare to baseline
Alerts:
Queue backup:
- Queue depth > 1000 → alert
- Oldest job > 1 hour → alert
Failure rate spike:
- Failure rate > 5% → alert
- Per-type sudden change → alert
Stuck jobs:
- Processing > 2x avg → investigate
- Worker not heartbeating → restart
Tools:
Built-in:
- Vercel dashboard for Vercel Queues
- Inngest dashboard for Inngest
- Bull Board for BullMQ
Integrations:
- Send metrics to Datadog / Grafana / New Relic
- Logs to Sentry / Datadog Logs
- Traces (OpenTelemetry)
Custom dashboard:
- BI tool (Looker / Mode)
- Per-team metrics
Anti-patterns:
No metrics:
- Failures invisible
- Performance unknown
Metric overload:
- 100 metrics; nobody looks
- 5-10 priority KPIs
Logs only:
- Hard to spot trends
- Metrics surface patterns
Output:
1. Metric framework
2. Alerting thresholds
3. Tooling
4. Dashboard
5. On-call playbook
The "queue depth as canary" rule: queue grows = workers can't keep up. Either scale up workers or fix slow jobs. Critical metric.
8. Status updates to users
Long jobs need user feedback.
Communicate job status.
Patterns:
Immediate response:
- Job submitted; return job ID
- "Processing your request..."
Polling:
- Client polls status endpoint
- 1-5 sec interval
- Stop on completion
WebSocket / SSE:
- Server pushes updates
- Real-time progress
- For sub-second updates
Webhook (back to user's system):
- For customer integrations
- "Notify me when done at this URL"
Email when done:
- For long jobs (>30 sec)
- "We'll email you when ready"
Progress UI:
Spinner:
- Indeterminate; "working..."
- For unknown duration
Progress bar:
- "27 of 100 items processed"
- For known progress
Step indicator:
- "Step 2 of 5: Validating data"
- For multi-step
Time estimate:
- "~2 minutes remaining"
- For predictable
Cancellation:
Cancel button:
- For long-running
- Mark job as cancelled
- Worker checks; aborts gracefully
Constraints:
- Some jobs uncancellable (already partially side-effected)
- Be explicit
Anti-patterns:
Silent processing:
- User waits; no feedback
- Frustration; abandons
Misleading progress:
- "99%" stuck for minutes
- Worse than no progress
No completion signal:
- User wonders if done
- Always notify
Output:
1. Status communication pattern per job
2. UI per pattern
3. Cancellation handling
4. Notifications (email / in-app)
5. Long-job UX (background + email when done)
The "30-second rule" for UX: jobs >30s should show progress + offer "we'll email you." Anything else feels broken.
9. Local development + testing
Background jobs are hard to debug locally.
Background jobs in local dev.
Options:
Vercel Queues:
- vercel dev runs locally
- Queues simulated locally
- Limited (single worker; no real queue)
Inngest:
- Inngest dev server
- Local UI for inspecting jobs
- Excellent dev experience
Trigger.dev:
- Local dev server
- Hot reload of jobs
BullMQ + Redis:
- Run Redis locally
- Worker process locally
- More setup; full feature
Mock:
- Tests: mock queue
- Don't actually run jobs
- Run job logic synchronously
Testing:
Unit tests:
- Test job logic in isolation
- Mock external calls
- Fast
Integration tests:
- Submit job; wait for completion
- Test with real queue (test mode)
- Slower but catches integration bugs
Test fixtures:
Test data:
- Sample payloads
- Seeded DB state
Cleanup:
- Reset queue between tests
- Reset DB
Debugging tips:
Log payload + result:
- Always log job inputs + outputs
- Easier to reproduce
Replay capability:
- Save failed payloads
- Replay locally to debug
Inspect tool:
- Inngest / Trigger.dev / BullBoard show queue state
- Critical for debugging
Output:
1. Local dev setup
2. Testing strategy
3. Mock approaches
4. Logging / debugging
5. Replay workflow
The Inngest dev server: best-in-class for local debugging. Visual inspector; replay; step-through. If you use Inngest, install + use it.
10. Migration from sync to async
Existing sync code → async background.
Migrate sync to async.
Steps:
1. Identify candidate operations
- Slow >1s
- Side-effecting (emails, webhooks)
- Bulk
2. Wrap in job
- Same logic; queue submission
- Return job ID instead of result
3. Update API
- Was: synchronous response
- Becomes: job ID + status URL
4. Update frontend
- Show "processing" state
- Poll or subscribe for completion
5. Migrate gradually
- Feature flag: sync vs async
- Roll out to subset
- Compare performance
6. Cut over
- All users on async
- Remove sync path
Compatibility:
Backward compat:
- Old API still works (returns sync; wraps async internally)
- New API explicit async
- Sunset old after migration
API design:
Sync API (legacy):
POST /api/process
Response: result (after wait)
Async API (new):
POST /api/process
Response: { job_id, status_url }
GET /api/jobs/:id
Response: { status, result?, error? }
Webhook (advanced):
Customer registers webhook URL
We POST result on completion
Anti-patterns:
Big-bang migration:
- Risky; long downtime potential
- Gradual + feature flag better
No backward compat:
- Breaks existing customers
- Always provide path
Output:
1. Migration plan
2. API design (sync + async)
3. Feature flag strategy
4. Rollout plan
5. Sunset timeline
The "feature flag + gradual rollout" pattern: 1% of users on async; monitor; 10%; 50%; 100%. Catches issues early.
What Done Looks Like
A v1 background job system for B2B SaaS in 2026:
- Queue platform chosen (Vercel Queues / Inngest / Trigger.dev)
- Job lifecycle + states defined
- Retry + exponential backoff
- Idempotency keys
- Dead letter queue + alerting
- Concurrency + rate limiting per job type
- Per-tenant fairness
- Metrics + alerting
- User-facing status (polling / WebSocket / email)
- Local dev experience
- Test coverage
Add later when product is mature:
- Multi-step workflows (Inngest / Temporal)
- Auto-scaling workers
- Priority queues
- Cancellation
- Replay tooling
- Per-tenant SLA differentiation
The mistake to avoid: synchronous slow operations. Block requests; timeout; bad UX.
The second mistake: no idempotency. Retries cause double-charges / double-emails.
The third mistake: DLQ ignored. Failures pile up; customers affected silently.
See Also
- Cron & Scheduled Tasks — adjacent (cron-specific)
- HTTP Retry & Backoff — retry patterns
- Idempotency Patterns — idempotency
- Outbound Webhooks — webhook delivery (often async)
- Inbound Webhooks — receiving webhooks
- Webhook Signature Verification — security
- Email Deliverability — email sending
- Image Upload Processing Pipeline — async image processing
- PDF Generation in App — async PDF
- AI Features Implementation — AI workloads (often async)
- LLM Cost Optimization — adjacent
- Performance Optimization — perf
- Multi-Region Deployment — adjacent
- Logging Strategy & Structured Logs — observability
- Metrics & OpenTelemetry Instrumentation — metrics
- Incident Response — when jobs fail at scale
- Quotas, Limits & Plan Enforcement — per-tenant limits
- VibeReference: Vercel Queues — Vercel native
- VibeReference: Vercel Workflow — durable workflows
- VibeReference: Vercel Functions — Functions
- VibeReference: Background Jobs Providers — Inngest / Trigger.dev / BullMQ
- VibeReference: Container & PaaS Platforms — long-running services
- VibeReference: Observability Providers — Datadog / New Relic
- VibeReference: Error Monitoring Providers — Sentry