Bulk Operations & Batch Processing: Let Customers Edit 5,000 Things at Once Without Breaking Your Database
Bulk Operations Strategy for Your New SaaS
Goal: Ship bulk operations (multi-select edit, bulk delete, mass tag, batch import-and-process) that let customers act on hundreds or thousands of items at once — with progress UI, partial-success handling, undo capability, and back-end safety so the database doesn''t lock during the operation. Avoid the failure modes where founders ship "select all" without backend support (UI spinning forever as 50K rows update synchronously), no progress indicator (customer thinks it''s broken), no undo (one accidental "delete all" call ruins the customer''s day), or no rate limit (one bulk op blocks every other customer''s requests).
Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.
Timeframe: Backend bulk-action API + progress UI shipped in week 1. Partial-success handling + undo in week 2. Rate limits + audit + edge cases in week 3. Quarterly review baked in.
Why Most Founder Bulk Ops Are Broken
Three failure modes hit founders the same way:
- Sync UI processing. Founder writes "select all → click delete" that calls a single endpoint with 5,000 IDs. Backend processes inline; HTTP timeout at 30 seconds; customer sees error; some rows deleted, others not; state is inconsistent.
- No partial-success handling. Bulk operation fails on row 47 of 5,000; rolls back everything; customer is angry that 46 valid edits were lost. Or: doesn''t roll back; customer doesn''t know which 46 succeeded; reconciles state by hand.
- No undo. Customer accidentally clicks "Delete all" with the wrong filter applied; loses 5,000 records; tears at support. No rollback option; angry tweet incoming.
The version that works is structured: async backend processing with progress, partial-success per row with detailed status, undo for destructive operations within a window, rate limiting per workspace, and audit logging for accountability.
This guide assumes you have already done Authentication (bulk ops are user-scoped), have shipped Multi-Tenant Data Isolation (bulk respects workspace boundaries), have shipped Roles & Permissions (RBAC) (some bulk ops are admin-only), have considered Background Jobs Providers (the queue layer), have shipped Audit Logs (bulk ops are high-value events), and have shipped Rate Limiting & Abuse Prevention.
1. Decide Which Operations Need Bulk
Not every action needs a bulk version. Decide deliberately.
Help me decide which actions need bulk variants.
The candidates:
**Always need bulk** (most products):
- Delete (multi-select; one of the most-requested)
- Tag / un-tag (batch labeling)
- Move / archive (organizational)
- Status / state changes (mark-as-done; mark-as-shipped; etc.)
- Assignment changes (assign to user X)
**Sometimes need bulk**:
- Export (per [account-deletion-data-export](account-deletion-data-export-chat.md) — bulk export = data takeout)
- Edit fields (multi-select edit one field across rows)
- Apply template / preset
**Rarely need bulk**:
- Create (bulk creation usually = [CSV import](csv-import-chat.md) instead)
- Detailed-edit (bulk edit on most fields creates conflicts)
**Don''t need bulk**:
- Per-record operations that have side effects (sending emails — too risky in bulk)
- Workflow transitions that require per-record review
For my product:
- The 5-10 actions that customers do repeatedly
- Which ones would benefit from "do this to N items at once"
- Which ones are too risky for bulk
Output:
1. The bulk-action catalog
2. The risk assessment per action
3. The "skip bulk" list with reasons
The biggest unforced error: shipping bulk on every action because "users want it". Some actions (sending emails to N customers; permanent deletes without undo) are too risky for bulk without serious safeguards. Triage: is the bulk version actually needed, and what safeguards must come with it?
2. Process Asynchronously With Progress
The HTTP request that "kicks off" a bulk op shouldn''t process the whole thing inline. Use a queue.
Design the async processing pipeline.
The pattern:
**Phase 1: Receive (HTTP handler)**
1. Customer selects N items + clicks bulk action
2. Frontend POSTs `{ action: 'delete', ids: [...] }` to `/api/bulk/operations`
3. Backend validates: user has permission for this action; all IDs in same workspace; quota OK
4. Backend creates a `bulk_operations` record:
- operation_id (UUID)
- user_id, workspace_id
- action type
- target IDs
- status: 'pending'
- total_count, processed_count: 0
5. Backend enqueues a background job (per [background-jobs-providers](https://www.vibereference.com/backend-and-data/background-jobs-providers))
6. Returns: `{ operation_id, status_url }`
**Phase 2: Process (background worker)**
1. Worker picks up the operation
2. Marks status: 'processing'
3. For each ID in the target list:
- Apply the action
- Update per-item status (success / failure / reason)
- Increment processed_count
4. On completion: status: 'completed' (or 'completed_with_errors')
**Phase 3: Notify**
1. Frontend polls `status_url` (or websocket)
2. Shows progress: "Processed 47 of 5,000 (1%)"
3. On completion: success summary or error report
**Storage**:
```sql
CREATE TABLE bulk_operations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
workspace_id UUID NOT NULL REFERENCES workspaces(id),
action TEXT NOT NULL, -- 'delete', 'tag', 'archive', etc.
target_count INT NOT NULL,
processed_count INT NOT NULL DEFAULT 0,
succeeded_count INT NOT NULL DEFAULT 0,
failed_count INT NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'pending', -- pending / processing / completed / failed / cancelled
parameters JSONB, -- action-specific params (new tag value, etc.)
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT,
is_undoable BOOLEAN NOT NULL DEFAULT FALSE,
undo_until TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE TABLE bulk_operation_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
operation_id UUID NOT NULL REFERENCES bulk_operations(id) ON DELETE CASCADE,
target_id UUID NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending / succeeded / failed / skipped
error_message TEXT,
previous_state JSONB, -- for undo
processed_at TIMESTAMP
);
CREATE INDEX idx_bulk_op_items ON bulk_operation_items(operation_id);
Critical implementation rules:
- Never block the HTTP request on long-running bulk ops.
- Persist target IDs. Don''t re-fetch from a query (state may change mid-op).
- Process in batches within the worker (e.g., 100 at a time; transaction per batch).
- Update progress incrementally. Don''t wait until end to surface state.
- Handle worker death gracefully. Resume from last processed item.
Don''t:
- Run the operation inline in the HTTP request
- Lock the entire dataset (use small transactions)
- Forget partial-success state
Output:
- The bulk_operations schema
- The endpoint code
- The worker code with batching
- The progress-polling endpoint
The biggest single performance win: **batched async processing.** A 5,000-item bulk op processed in 50 batches of 100 each (each in its own transaction) completes in seconds without locking; the same op processed inline in one transaction can lock the table for minutes and fail under load.
---
## 3. Handle Partial Success
Bulk ops can fail on individual items. Don''t roll back the whole thing.
Design partial-success handling.
The pattern:
Per-item result tracking:
For each item in the bulk op:
- Apply the action
- If success: mark
bulk_operation_items.status = 'succeeded' - If failure: mark
'failed'+ reason - Increment counters on parent
bulk_operationsrecord
Common failure reasons:
- Item already deleted (race with another operation)
- Item lacks required state (e.g., can''t archive an active subscription)
- Permission lost during operation (user demoted mid-op)
- External service failure (e.g., search index update failed)
The customer-facing result UI:
After completion:
- "Processed 5,000 items: 4,950 succeeded, 50 failed"
- Expandable: "View failed items" → list with reasons
- "Retry failed" button (re-runs only failed items)
Aggregating failure reasons:
If 50 of 5,000 failed all with the same reason: surface as one error type ("47 items couldn''t be archived because they''re currently active"). Don''t list 50 individual errors.
The "all-or-nothing" mode (rare but useful):
Some operations should be atomic:
- "Move all selected items to project X" — if X is full, fail all
- Use a transaction; rollback on first error
- Fewer use cases; document explicitly
Critical implementation rules:
- Default to per-item independence. One failure shouldn''t cancel others.
- Surface failures clearly. Customer needs to know what failed and why.
- Make retry-failed possible. Don''t make the customer re-select.
- Rate-limit failures. If 1,000 fail in a row, halt and alert (something''s wrong).
Don''t:
- Roll back successes when one item fails
- Hide failures
- Treat failures as "the operation failed" globally
Output:
- The per-item status tracking
- The aggregation logic
- The result UI with retry-failed
- The "circuit breaker" for cascading failures
The biggest customer-trust factor: **showing exactly which items failed and why.** A customer who can see "47 failed because [reason]" can fix and retry; a customer who just sees "operation failed" has no path forward.
---
## 4. Build Undo for Destructive Operations
Bulk delete is the most-feared operation. Make it undoable.
Design undo.
The pattern:
For destructive operations (delete, archive, mass-state-change):
- Store previous state in
bulk_operation_items.previous_state - Mark
is_undoable: trueon the bulk_operations record - Set
undo_until(typically 30 minutes to 24 hours)
Undo UI:
After successful destructive op:
- Banner / toast: "Deleted 5,000 items. [Undo]"
- Persistent until dismissed or undo_until expires
- Click "Undo" → restore items
- Or: visit operation detail page; click "Undo this operation"
Undo execution:
When customer clicks undo:
- Background job processes each item in the operation
- Restores from
previous_state - Marks the original op as undone
- Notifies customer when complete
Undo storage:
// previous_state for a delete:
{
"deleted": false,
"data": { "title": "...", "content": "...", ... }
}
// previous_state for an archive:
{
"archived": false,
"archived_at": null
}
// previous_state for a tag change:
{
"tags": ["old-tag-1", "old-tag-2"]
}
The undo window:
- Short (30 min): low storage cost; surprise customers
- Long (7 days): higher storage cost; customer-friendly
- Default: 24 hours
Critical implementation rules:
- Capture previous state BEFORE applying changes. Once changed, you can''t recover otherwise.
- Store enough to fully restore. Don''t store just IDs; store fields needed to rebuild.
- Clean up after undo window. Stale previous_state takes storage.
- Audit undo events. Per Audit Logs: when an undo runs, log it.
Edge cases:
- Item modified after delete-undo: customer deleted X, modified another version of X (impossible if X was deleted), undid: restore as it was at delete time
- Cascade restore: delete a project; cascade-deletes 50 tasks. Undo: restore project AND tasks. Track relationship.
- External service involvement: bulk op also called external API (e.g., Stripe). Undo may not be possible there. Document.
For non-undoable ops:
Some ops can''t be undone (e.g., already-sent webhooks, already-charged payments). Surface explicitly: "This action cannot be undone. Type 'CONFIRM' to proceed."
Don''t:
- Make undo unavailable for destructive ops (always provide unless impossible)
- Use a global undo (per-operation is clearer)
- Forget cascade restoration
Output:
- The undo storage schema
- The undo execution worker
- The undo UI / banner
- The undo-window cleanup job
The single biggest customer-trust feature: **a 30-minute undo window after bulk delete.** A customer who panic-deletes 1,000 items at 11pm clicks undo at 11:01pm; everything restored. Without undo, they file a frantic support ticket and you spend hours restoring from backup.
---
## 5. Rate-Limit and Quota Bulk Ops
One customer running a 100K-item op can starve every other customer. Constrain.
Design rate limits.
The pattern:
Per-workspace concurrency:
- Max N concurrent bulk operations per workspace (typically 1-3)
- New op requested while N already running: queue OR reject with "too many in-flight"
- Prevents one workspace from monopolizing workers
Per-action volume limits:
| Action | Per-tier limit | Why |
|---|---|---|
| Delete | 10K items/day | Protects against accidental mass-delete |
| Tag | 100K items/day | Cheap operation; high tolerance |
| Export | 1 export/hour | Prevents data exfiltration spam |
| Bulk-edit | 50K items/day | Complex operation |
Per-tier scaling:
- Free tier: lower limits (1K items/op max)
- Paid tier: standard limits
- Enterprise: customer-configurable
Worker concurrency:
- Total worker pool: e.g., 10 workers
- Per-customer hard cap: e.g., 2 workers
- Prevents one customer from using all workers
Detection of abusive patterns:
- Spike in bulk-op volume from one customer
- Repeated failures (cascading errors)
- Off-hours volume (3am bulk delete from new account)
Trigger alerts; pause if needed.
Critical implementation rules:
- Document limits publicly. Customers should know.
- Surface limits clearly when hit. "You''ve hit the daily bulk-delete limit. Try again tomorrow OR contact support to raise."
- Don''t silently throttle. Customer should know.
For very large operations:
- Customer wants to delete 1M items
- Standard limits prevent it in one op
- Solution: enterprise feature; manual review; scheduled execution
- Or: customer can do across multiple days
Don''t:
- Allow unlimited bulk ops (DDoS risk)
- Skip per-customer quotas (one bad actor ruins it for all)
- Forget worker concurrency caps
Output:
- The per-action limit table
- The concurrency cap
- The detection rules
- The customer-facing rate-limit messaging
The biggest invisible cost: **one customer''s bulk op blocking all other customers.** Without per-customer worker caps, a 100K-item delete starves the worker pool; everyone else''s smaller ops queue indefinitely. Per-customer limits prevent this.
---
## 6. Surface Progress in the UI
A customer who clicks "delete 5,000" and sees a spinner for 30 seconds thinks the app is broken.
Design the progress UX.
The pattern:
Immediate response:
- Optimistic UI: items removed from view immediately (show "Deleting...")
- OR: blocking modal: "Processing your bulk action..."
Progress indicator:
- Progress bar: "Processed 1,247 of 5,000 (25%)"
- Time estimate: "About 30 seconds remaining"
- Cancellable button: "Cancel" (if supported)
Background-processing pattern:
- Customer can navigate away mid-op
- Banner persists at top: "Bulk operation in progress: 1,247 / 5,000"
- Click banner: open detail / cancel
Real-time updates:
- WebSocket / SSE for sub-second updates
- Or: polling every 1-2 seconds
- Update counter / progress bar
On completion:
- Toast: "5,000 items deleted. [Undo]"
- Or: persistent banner with details
- Failed-items expandable list
Mobile considerations:
- Most bulk ops happen on desktop (better selection UX)
- Mobile: simpler progress UI; allow background completion
Critical implementation rules:
- Feedback within 1 second of click (even just "Starting...").
- Show progress, not just spinner. Spinner = unknown state; progress bar = known state.
- Allow navigation away. Don''t trap the customer.
- Clear completion signal. Don''t make customer guess if it''s done.
The cancel option:
For long-running ops:
- Cancel button on progress UI
- Cancel signals worker; worker stops processing remaining items
- Items already processed are not undone (but undo flow available afterward)
- Mark op as cancelled; customer sees "Cancelled at X of Y"
Don''t:
- Block the UI for the entire op
- Show only "Loading..."
- Hide failures during progress
Output:
- The progress UI component
- The polling / WebSocket update flow
- The cancel mechanism
- The completion notification
The biggest UX moment: **the first second after the customer clicks bulk action.** If they see immediate feedback (item count appearing in the progress, items disappearing from view), they trust the system. If they see a frozen spinner, they panic.
---
## 7. Audit Bulk Operations
Bulk ops are high-leverage; small mistakes affect many records. Audit thoroughly.
Design the audit integration.
Per Audit Logs, log:
Bulk op lifecycle:
bulk_operation.started— who, what action, what targets, countbulk_operation.completed— success/failure countsbulk_operation.cancelled— by whom, at what pointbulk_operation.undone— when undo executedbulk_operation.failed— if globally failed
Per-item events (sample at high volume):
For destructive ops, log per-item:
item.bulk_deleted— record ID + previous state hashitem.bulk_archived— etc.
For 5,000-item ops, that''s 5,000 audit entries. Use:
- Sample 1% for non-destructive ops
- Log 100% for destructive ops (cheap insurance)
The "who" and "context":
- Acting user
- Workspace
- IP / user agent
- Operation ID linking back to bulk_operations
- Trigger (UI / API / scheduled)
Customer-facing audit feed:
In /admin/audit (per audit-logs):
- Show bulk operations prominently
- Filterable by action type
- Linkable to operation detail (what was affected)
Compliance / regulator support:
- Audit logs answer "what happened to data X on date Y"
- Especially important for destructive ops
- Retain per account-deletion-data-export: 7-year retention typical for high-value events
Don''t:
- Skip audit logging on bulk ops (these are high-leverage events)
- Log raw sensitive data in audit (mask PII)
- Forget to surface customer-visible audit views
Output:
- The audit-event schema for bulk ops
- The sampling strategy
- The customer-facing audit view
- The retention policy
The biggest "we need audit logs"-moment: **a customer claims their data was wrongly deleted.** Without audit, you can''t investigate. With it: "On [date] at [time], user X ran bulk-delete operation [ID] affecting these 5,000 items. Here''s the full record." Audit is your defense and the customer''s recourse.
---
## 8. Handle Edge Cases
Real bulk ops have weird cases. Plan.
The edge case checklist.
Edge case 1: Customer selects "all" but means "all on this page"
- "Select all" checkbox ambiguous: is this 50 or 5,000?
- Fix: explicit "Select all 5,000" link after page-select
- Or: cap "select all" at visible items; require explicit "Select all matching filter"
Edge case 2: Filter changes mid-selection
- Customer selects items; changes filter; "selected" set is now stale
- Fix: capture IDs at click-time; ignore filter changes
- Or: warn "Filter changed; selection cleared"
Edge case 3: Permissions change during op
- User has permission to delete; mid-op, role changed
- Fix: check permission per-item; skip un-authorized; report
Edge case 4: Item modified during op
- Item modified by another user mid-bulk-op
- Fix: optimistic locking with version IDs
- On conflict: skip; report
- Customer sees "47 items skipped (modified by others)"
Edge case 5: Cascade impacts
- Bulk-deleting 100 projects cascades to 10K tasks
- Fix: surface impact in confirmation dialog; "This will affect 10,047 records"
Edge case 6: External service fails during op
- Bulk update of customer records also calls Stripe API
- Stripe down: bulk op partially succeeds (DB updated; Stripe didn''t)
- Fix: retry external calls; surface as failure if persistent
Edge case 7: Workspace-wide vs filtered "all"
- "Delete all" without filter = 100K items
- Fix: confirm explicitly with count; require typed confirmation
Edge case 8: Confirmation dialog text
- "Are you sure?" is weak
- Better: "You are about to delete 5,000 items. Type DELETE to confirm."
Edge case 9: API access to bulk ops
- Customers calling bulk endpoints via API
- Same rate limits; require auth scope per api-keys-chat
- Document; provide examples
Edge case 10: Scheduled bulk operations
- "Delete items older than 90 days" weekly
- Different from interactive bulk; less risky if scoped carefully
- Audit + alert on unusual volumes
Output:
- Handling per edge case
- The confirmation-dialog patterns
- The locking strategy
- The cascade-impact warning
---
## 9. Test Bulk Operations Carefully
Bulk ops affect many records; bugs are amplified. Test thoroughly.
Design the test suite.
The tests:
Unit tests:
- Per-action handler: applies correctly
- Permission checks
- Rate limits
Integration tests:
- End-to-end: trigger bulk op; verify all items processed correctly
- Partial failure: some items fail; verify others succeed
- Race conditions: items modified mid-op
- Cancellation mid-op
- Undo restores correctly
Load tests:
- 1K item op: timing, memory
- 10K item op: same
- 100K item op: same
- Concurrent ops from multiple workspaces
Failure-mode tests:
- Worker crashes mid-op: resume correctly
- Database connection drops: retry
- External service fails: handles gracefully
Customer-facing tests:
- Manual exploratory: as a real user, click bulk action; verify UI matches expectations
Test data:
- Sample data with realistic volumes
- Edge cases: items with weird states, missing fields
- Multi-workspace scenarios
Don''t:
- Skip load tests (you''ll find out at scale)
- Forget failure-mode tests
- Test with too-small datasets
Output:
- The test suite
- The load-test scenarios
- The failure-injection tests
- The customer-facing exploratory test plan
---
## 10. Quarterly Review
Bulk ops accumulate edge cases. Quarterly review.
Quarterly review.
Usage metrics:
- Bulk ops triggered per period
- Volume distribution (small / medium / large)
- Most-common actions
- Failure rate per action
Performance:
- Avg processing time per item
- Worker utilization
- Concurrency caps hit?
Customer impact:
- Support tickets about bulk ops
- Undo usage rate (high = customer regret pattern)
- Cancellation rate (high = ops too slow)
Quality:
- Most-common failure reasons
- Items skipped due to race conditions
New actions:
- Customer-requested bulk variants we don''t have
- Action requests that should NOT have bulk
Output:
- Snapshot
- 1-2 fixes
- 1 process improvement
---
## What "Done" Looks Like
A working bulk operations system in 2026 has:
- A bulk-action catalog with risk-assessed scope
- Async processing via background workers (per [background jobs](https://www.vibereference.com/backend-and-data/background-jobs-providers))
- Per-item status tracking with partial-success handling
- Undo for destructive operations within 24-hour window
- Per-workspace rate limits + concurrency caps
- Real-time progress UI (polling or WebSocket)
- Cancellation support during long ops
- Audit logging on every bulk event
- Confirmation dialogs for destructive ops with typed-confirmation
- Edge-case handling (filter-change / permission-change / cascade / external services)
- Test coverage including load tests
- Quarterly review baked in
The hidden cost of weak bulk ops: **a single accidental "delete all" that nukes a customer''s data.** Without undo + confirmation + audit, the recovery is hours of restoring from backup AND eroded customer trust forever. The infrastructure (async workers + undo storage + audit) is small; the protection it provides is enormous. Bulk operations are the highest-leverage product surface; build them with the most care.
---
## See Also
- [CSV Import Flows](csv-import-chat.md) — bulk creation flow
- [Account Deletion & Data Export](account-deletion-data-export-chat.md) — bulk export
- [Multi-Tenant Data Isolation](multi-tenancy-chat.md) — workspace boundary
- [Roles & Permissions (RBAC)](roles-permissions-chat.md) — permission per bulk action
- [Audit Logs](audit-logs-chat.md) — high-value events logged
- [Rate Limiting & Abuse Prevention](rate-limiting-abuse-chat.md) — bulk ops are abuse vector
- [API Keys & PATs](api-keys-chat.md) — programmatic bulk access
- [In-App Notifications](in-app-notifications-chat.md) — notify on completion
- [Real-Time Collaboration](real-time-collaboration-chat.md) — bulk ops affect collaborators
- [Background Jobs Providers](https://www.vibereference.com/backend-and-data/background-jobs-providers) — queue layer
- [Database Providers](https://www.vibereference.com/backend-and-data/database-providers) — bulk ops stress the DB
[⬅️ Growth Overview](README.md)