Content Moderation Pipeline: Stop Bad Content Before It Stops Your Business
If your SaaS lets users post anything in 2026 — comments, images, profiles, reviews, custom prompts to AI features, file uploads — you need a moderation pipeline. The threat isn't theoretical: AI-driven spam farms target SaaS at scale; CSAM (child sexual abuse material) is a legal liability for any host; harassment and brigading have killed once-promising platforms. Most indie founders ship "let users post anything; we'll moderate when there's a problem" and pay for it later in support tickets, App Store removal, advertiser flight, or the visit from federal investigators. The fix isn't human review of every item — it's a tiered automated system with human escalation.
A working moderation pipeline answers: what content types need moderation (text / image / video / audio / user actions), what's the threat model per type (spam / harassment / illegal / off-policy), what's auto-handled vs human-reviewed, how do you handle user reports, what's the appeal process, what's logged for legal, and what tools (PhotoDNA / vision APIs / LLM classifiers) do the heavy lifting.
This guide is the implementation playbook for moderation. Companion to CAPTCHA & Bot Protection, Image Upload & Processing Pipeline, Rate Limiting & Abuse, Audit Logs, and Roles & Permissions.
Why Moderation Matters
Get the threat model clear first.
Help me understand the threats.
The categories:
**1. Illegal content (legal risk; mandatory action)**
- CSAM (child sexual abuse material)
- Terrorism / inciting violence
- IP / copyright violations (DMCA)
- Doxing / personally-identifying info weaponized
Required: detect, remove, report (NCMEC for CSAM in US).
Failure mode: criminal liability; site shutdown.
**2. Off-policy / TOS violations (brand risk)**
- NSFW / adult (depending on TOS)
- Hate speech / harassment
- Spam / scam content
- Bot-generated noise
- Promotion of harmful behavior
Required: enforce TOS; remove violations.
Failure mode: brand damage; user flight; advertiser flight.
**3. Quality issues (UX risk)**
- Low-quality posts diluting feed
- AI-slop content (auto-generated low-effort posts)
- Off-topic content in scoped communities
- Duplicate posts
Required: lower visibility / require human review at scale.
Failure mode: platform feel deteriorates; users leave.
**4. AI-prompt abuse (NEW in 2024-26)**
- Jailbreak attempts in prompts
- Generating harmful content via your AI features
- Prompt-injection attacks
Required: input filtering + output filtering on AI features.
Failure mode: AI generates illegal or off-policy content; you're liable.
**5. Targeted harassment**
- Coordinated brigading
- Stalking via your product
- Mass-reporting weaponization
Required: per-user rate limits + pattern detection.
Failure mode: vulnerable users harmed; PR disaster.
For my app:
- Content types
- User base risk profile
- Compliance requirements
Output:
1. Top threats
2. Coverage gaps today
3. Priority order
The biggest unforced error: assuming "we don't have user-generated content" when you do. Profile photos, names, custom-prompts to AI, support tickets, billing addresses — any free-text or media field is moderation surface. Audit fields; pick coverage strategy.
The Pipeline Architecture
Help me design the pipeline.
The 4-stage pipeline:
User submits → Pre-publish filtering → Post-publish monitoring → User reports → Human review ↓ ↓ ↓ ↓ Block / Hold Flag / Remove Triage queue Decision + appeal
**Stage 1: Pre-publish (synchronous)**
Block obviously-bad content before it goes live.
- Hash matching (CSAM via PhotoDNA): block + report
- High-confidence NSFW detection: block + warn
- Spam classifier: hold for review
- Disposable email / new account: friction (CAPTCHA)
Latency budget: 100ms-2s. User waits.
**Stage 2: Post-publish (async)**
For content that passed Stage 1 but needs deeper analysis.
- Run additional moderation (slower / more expensive models)
- Embedding-based clustering (find spam rings)
- AI scoring across all signals
- If flagged: hide pending review
**Stage 3: User reports**
Users flag content they see as problematic.
- "Report" button on every post / comment / profile
- Categorized reasons (harassment / spam / illegal / off-topic)
- Reports go to triage queue
- Auto-action when N reports cross threshold
**Stage 4: Human review queue**
For content escalated by stages 1-3.
- Trust & safety team / contractor reviews
- Decisions: keep / remove / ban user / escalate to legal
- Track reviewer accuracy
- 24-72h SLA on review
- Appeal process for users
**Decision flow per item**:
Item submitted ↓ Pre-publish check
- High-confidence harmful → BLOCK + log
- Medium confidence → ALLOW but flag for stage 2
- Low confidence / clean → ALLOW ↓ Post-publish check (async, 1-60 min)
- High confidence → REMOVE (if visible) + flag user
- Medium → SHADOW BAN (visible to author only)
- Low → leave alone ↓ User reports surface
- 1 report → QUEUE for review
- 5+ reports OR trusted reporter → AUTO-HIDE pending review ↓ Human review
- Confirm: decision sticks
- Reverse: restore + warn reporter (if false-flagging)
For my product: [content types]
Output:
1. Per-content-type pipeline
2. Latency budgets
3. Tooling per stage
The principle: automated does the volume; humans do the nuance. Auto-blocking high-confidence violations is essential at scale. Auto-blocking medium-confidence kills legit posts. Tier the system.
Text Moderation: The 2026 Stack
Help me set up text moderation.
The 2026 layered approach:
**Layer 1: Allowlist / Denylist (cheap)**
Keyword filters; regex; URL blocklists.
Pros: instant; deterministic
Cons: brittle; bypassable; misses nuance
Use for: known-bad URLs (phishing); obvious slurs (with context-awareness).
**Layer 2: ML classifier (medium)**
OpenAI Moderation API (free), Google Perspective API, AWS Comprehend.
```typescript
const moderation = await openai.moderations.create({
input: userText,
});
// returns flags by category:
// hate, sexual, violence, self-harm, harassment
const result = moderation.results[0];
if (result.flagged && result.category_scores.sexual > 0.9) {
// Block
}
Pros: free / cheap (OpenAI Moderation is free); good enough for ~85% of cases Cons: false positives on edge cases; misses context
Use for: first-pass filtering; catching obvious violations.
Layer 3: LLM-based contextual moderation (expensive)
For ambiguous cases, run an LLM:
const decision = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'system',
content: `You are a content moderator for a SaaS platform.
Categories of policy violations:
- Spam (commercial unsolicited)
- Harassment (targeted abuse)
- Off-topic
- Hateful (slurs, dehumanization)
Reply with JSON: { violation: bool, category: string, confidence: 0-1 }`,
}, {
role: 'user',
content: `Moderate this content:\n\n${text}`
}],
response_format: { type: 'json_object' },
});
Pros: nuanced; understands context; multi-lingual Cons: $0.001-0.01/check; latency; potential prompt injection
Use for: borderline cases; multi-lingual content; tone-sensitive judgments.
Layer 4: Embedding-based clustering (offline)
Embed all posts → cluster → identify spam rings (1000 posts, all near-identical embeddings, all from new accounts).
// Periodically:
const embeddings = await getEmbeddings(recentPosts);
const clusters = await dbScan(embeddings);
const suspiciousClusters = clusters.filter(c =>
c.size > 50 && c.uniqueAuthors < 10 && c.medianAccountAge < 7
);
Catches coordinated spam that individual checks miss.
The orchestration:
async function moderateText(text: string, userId: string) {
// Layer 1: cheap denylist
if (denylist.matches(text)) {
return { decision: 'block', reason: 'denylist' };
}
// Layer 2: ML classifier (free)
const ml = await openai.moderations.create({ input: text });
if (ml.results[0].flagged) {
const top = getTopCategory(ml.results[0]);
if (top.score > 0.95) return { decision: 'block', reason: top.category };
if (top.score > 0.7) return { decision: 'review', reason: top.category };
}
// Layer 3: LLM only if borderline AND from suspicious user
if (isUserSuspicious(userId) && ml.results[0].flagged) {
return await llmModerate(text);
}
return { decision: 'allow' };
}
For my use case: [text types]
Output:
- Layer config
- Tools per layer
- Cost estimate per 1K items
The cost-killer: **running LLM on every post**. Layer cheaper checks first; reserve LLM for the 5-10% of borderline cases. Total moderation cost should be < 1% of revenue at indie scale.
## Image / Video Moderation
Help me set up image moderation.
The 2026 stack:
Layer 1: Hash matching (CSAM)
PhotoDNA (Microsoft, free for qualified providers) hashes every uploaded image; matches against known CSAM database.
Action on match:
- Block immediately
- Preserve evidence (DON'T delete the file; preserve for law enforcement)
- Report to NCMEC (in US; legal requirement; auto via PhotoDNA partner)
- Suspend user account
This is non-optional for any host of user-uploaded images.
Layer 2: NSFW detection
- AWS Rekognition Moderation: $1/1000 images
- Google Cloud Vision SafeSearch
- Hive Moderation
- Sightengine
- Cloudflare Images (built-in basic moderation)
Returns categories with confidence:
- Explicit nudity
- Suggestive
- Violence
- Drugs
- Gore
Per category: action threshold.
const labels = await rekognition.detectModerationLabels({
Image: { Bytes: buffer },
MinConfidence: 60,
});
const blocked = labels.ModerationLabels?.some(l =>
['Explicit Nudity', 'Sexual Activity'].includes(l.Name) && l.Confidence > 90
);
Layer 3: Custom violations (your TOS)
Train your own classifier or use Hive's custom categories (logos, weapons, drugs, etc.).
Pre-trained for:
- Brand marks / logos (impersonation detection)
- Weapons
- Drug paraphernalia
- ID documents (privacy)
- Selfies of others (revenge porn)
Layer 4: Video moderation
Sample frames every N seconds; run image moderation on samples.
For long videos: extract every 2-5s; run; aggregate decisions.
Tools:
- Hive Video — handles full video
- AWS Rekognition Video — frame sampling built-in
- Sightengine Video
Animated detection:
GIFs / short videos: animated moderation. Tools handle automatically; just enable.
For my pipeline: [image types]
Output:
- Per-layer setup
- Tool picks
- Action thresholds
The legal requirement: **PhotoDNA on every user-uploaded image**. Not optional. Apply for free access (https://www.microsoft.com/photodna). Implement at upload-time. NCMEC reporting is automated by PhotoDNA partners.
## User Reports & Triage Queue
Help me handle user reports.
The system:
The Report button:
Visible on every piece of user-generated content. Click → modal with:
- Category dropdown (Spam / Harassment / NSFW / Illegal / Other)
- Optional comment box
- Report submitted; thank-you message
Schema:
CREATE TABLE content_reports (
id UUID PRIMARY KEY,
reporter_user_id UUID NOT NULL REFERENCES users,
reported_content_id UUID NOT NULL,
reported_content_type VARCHAR(50) NOT NULL,
reported_user_id UUID REFERENCES users,
category VARCHAR(50) NOT NULL,
comment TEXT,
status VARCHAR(20) DEFAULT 'pending', -- pending | reviewed | actioned | dismissed
reviewed_by UUID REFERENCES users,
reviewed_at TIMESTAMPTZ,
decision VARCHAR(50), -- 'remove', 'keep', 'warn', 'ban'
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON content_reports (status, created_at);
CREATE INDEX ON content_reports (reported_content_id);
Triage rules:
1 report → queue for review (low priority)
3+ reports → priority queue
5+ reports OR trusted reporter → auto-hide pending review
Report from verified-legal-counsel → highest priority + legal review
Reporter weighting:
Track reporter accuracy:
- Reports that led to action: positive signal
- Reports dismissed: neutral
- Repeated false reports: suspicious; deweight
ALTER TABLE users ADD COLUMN reporter_accuracy DECIMAL(3,2) DEFAULT 0.5;
Update on each review decision.
A user with 90% report accuracy = trusted; their reports auto-flag. A user with 10% accuracy = malicious reporter; their reports deweighted or ignored.
Anti-weaponization:
- Per-user report rate limit (e.g. 10/day)
- Cooldown on reporting same user (1/week per target)
- Patterns: "User A reports User B 50x in 24h" = auto-flag for harassment investigation
Response to reporter:
After review:
- "Thanks for your report. We took action: [removed]"
- Or "We reviewed and didn't find a violation. Thanks anyway."
This validates the system; encourages future reports.
For my product: [reporter base]
Output:
- Reports schema
- Triage rules
- Anti-weaponization
- Reporter feedback
The discipline: **respond to every report within 48 hours**, even if "no action." Silence makes users feel reports are ignored; they stop reporting; bad content piles up. Automated "report received" + human review with status update wins.
## Human Review Queue
Help me run a review queue.
The queue UI (simple):
[Item]
[Reporter info: trusted / new / repeat reporter]
[Categories flagged]
[Auto-moderation scores]
[Author history: account age, prior actions]
[Buttons: Keep | Hide | Remove | Ban User | Escalate]
[Notes box]
Reviewer per-item time: 30-60 seconds for clear cases; 2-5 min for ambiguous.
SLA targets:
- Illegal content: <4 hours
- Active harassment: <12 hours
- TOS violations: <48 hours
- Quality / spam: <72 hours
Staffing:
Coverage hours:
- Pre-revenue: founder reviews; 30-60 min/day
- $100K-1M ARR: contractor 4-8h/week
- $1M-10M ARR: 1-2 part-time T&S contractors
- $10M+ ARR: T&S team
Outsourcing:
- Hive Trust & Safety — managed reviewer pool
- Pactera Edge — outsourced moderation
- Concentrix — bigger BPO
DIY platforms:
- Modulate.ai — voice moderation
- Spectrum Labs — text + image moderation platforms
- Hive Trust & Safety — review queue UI + AI assist
Reviewer training:
- Day 1: TOS reading; calibration on 50 example items
- Day 2-7: shadowed reviews; lead reviews own
- Ongoing: weekly calibration on new edge cases
- Mental health: rotation; not 8h/day on harmful content; counseling access
Reviewer accuracy tracking:
Random-sample reviewed items; senior reviewer audits. Score reviewers; coach low-accuracy.
Decision logging:
CREATE TABLE moderation_decisions (
id UUID PRIMARY KEY,
content_id UUID NOT NULL,
reviewer_id UUID NOT NULL,
decision VARCHAR(50) NOT NULL,
reason VARCHAR(100),
notes TEXT,
reviewed_at TIMESTAMPTZ DEFAULT NOW(),
appealed_at TIMESTAMPTZ,
appeal_decision VARCHAR(50)
);
For my T&S team: [stage]
Output:
- Queue UI requirements
- SLA per category
- Staffing plan
- Training docs
The hidden cost: **secondary trauma** for reviewers. Burnout is high; turnover is high; counseling is non-optional. Even at small scale, watch for it. Outsourcing CSAM review specifically is the norm — don't make a junior engineer do it.
## Appeals Process
Help me set up appeals.
The legal context:
EU Digital Services Act (DSA): requires meaningful appeal mechanism for content removals. California / NYC laws: similar requirements emerging. Best practice everywhere: appeals build user trust.
The flow:
Content removed → User notified → "Appeal" button
↓
User explains why
↓
Different reviewer (not original)
↓
Decision: uphold / overturn
↓
User notified
The notification:
Subject: Your content was removed
Hi [user],
We removed your [post/comment/image] for [reason].
If you believe this was an error, you can appeal:
[Appeal button]
Your appeal will be reviewed by a different team member within [N] business days.
Read our content policy: [link]
Read our enforcement guidelines: [link]
Appeal review:
- Different reviewer than original
- Reviews original content + reporter info + appeal reasoning
- Decision: uphold (final) or overturn (restore + apologize)
Track:
- Appeal volume
- Overturn rate (>15% = original moderation too aggressive)
- Reviewer agreement rate (low = inconsistent rubric)
Repeat appeals:
User appeals → original reviewer's decision overturned. If pattern emerges (one reviewer with high overturn rate), retrain or remove.
Bad-faith appeals:
Limit appeals per user per month. After 5 dismissed appeals: deweight future ones.
For my product:
- Appeal volume expected
- Reviewer process
Output:
- Appeal flow
- Notification template
- Tracking
The discipline DSA-compliant: **a different reviewer handles the appeal**. Original reviewer can't review their own decision. Documented; auditable; defensible to regulators.
## Logging for Legal & Compliance
Help me log for legal.
The retention:
For ALL moderation actions, log:
- Content ID + content snapshot (preserve even after deletion for legal)
- Reviewer ID + decision + timestamp
- Reason / category
- Appeals + outcomes
CREATE TABLE moderation_audit (
id UUID PRIMARY KEY,
event VARCHAR(50) NOT NULL, -- 'submitted', 'auto-blocked', 'reviewed', 'removed', 'appealed', 'restored'
content_id UUID,
user_id UUID,
reviewer_id UUID,
decision VARCHAR(50),
reason VARCHAR(100),
metadata JSONB,
ip INET,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Retention:
- Action logs: 7 years (evidence)
- Removed content: at minimum required for legal (varies; CSAM evidence permanent until law enforcement says otherwise)
- User reports: 3 years
Specific legal compliances:
CSAM (US): Mandatory PhotoDNA + NCMEC reporting. Preserve evidence; don't delete. Federal law.
EU DSA: Transparency reports (volume of removals, categories, appeals); Trusted Flaggers; meaningful appeal.
EU Trust & Safety: GDPR + DSA aligned. Designated representative for non-EU companies serving EU.
UK Online Safety Act: similar to DSA; requires risk assessments.
Section 230 (US): protects platforms from liability for user content WITH "good faith" moderation. Document moderation; don't appear to encourage violations.
For my company:
- Jurisdictions
- Content types
Output:
- Audit schema
- Retention policy
- Compliance per jurisdiction
- Reporting cadence
The legal must-do: **retain CSAM evidence**. If you discover CSAM, do NOT delete. Block from view; preserve file; report to NCMEC; await law enforcement direction. Deleting destroys evidence; carries criminal liability.
## Common Moderation Mistakes
Help me avoid mistakes.
The 10 mistakes:
1. "We don't have UGC" denial You do. Profile photos, names, AI prompts, support tickets. Audit.
2. Skipping PhotoDNA on image upload CSAM legal liability; non-negotiable.
3. No human review escalation path Auto-only = both over- and under-removal at scale.
4. Reviewer burnout / no rotation T&S work is traumatic; rotation + counseling are required.
5. No appeal process DSA / CA / NY laws + user trust both demand appeals.
6. Reporter weaponization Mass-reporting one user shouldn't auto-ban; build resilience.
7. Not logging removed content Removed → deleted → no evidence; legal exposure.
8. Treating moderation as "TOS update" TOS is paper; pipeline is operational. Need both.
9. Public transparency report missing Required by DSA + good for trust; publish quarterly.
10. Ignoring AI-prompt abuse If users prompt YOUR AI features, harmful generations are YOUR liability.
For my product: [risks]
Output:
- Top 3 risks
- Mitigations
- Process changes
The mistake that ends careers: **delaying CSAM action**. Discover → act in minutes, not hours. Preserve evidence; report; suspend. Document everything. Federal investigators don't care about your sprint planning.
## What Done Looks Like
A working moderation pipeline delivers:
- Pre-publish + post-publish + user-report + human-review tiers
- PhotoDNA + NCMEC for CSAM (mandatory)
- ML classifier (OpenAI / AWS / Google) for first-pass text
- Vision API (AWS / Google / Hive) for first-pass image
- LLM for borderline / nuanced cases
- User report button on every piece of UGC
- Triage queue with SLA (4h illegal / 48h TOS)
- Different-reviewer appeal process
- Logging + 7-year retention for legal
- Quarterly transparency report
- Reviewer training + rotation + counseling
- Reporter accuracy weighting; weaponization detection
The proof you got it right: harmful content is rare; reports get responded to within SLA; appeals are <15% overturn rate; transparency report shows the system working; you can survive the day a journalist asks "what's your moderation process?" without panic.
## See Also
- [CAPTCHA & Bot Protection](captcha-bot-protection-chat.md) — companion abuse-protection layer
- [Image Upload & Processing Pipeline](image-upload-processing-pipeline-chat.md) — moderation at upload time
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — companion rate-limit layer
- [Audit Logs](audit-logs-chat.md) — moderation actions feed audit
- [Roles & Permissions](roles-permissions-chat.md) — reviewer permissions
- [Account Deletion & Data Export](account-deletion-data-export-chat.md) — banned-user cleanup
- [Customer Support Chat](customer-support-chat.md) — appeals flow into support
- [Internal Admin Tools](internal-admin-tools-chat.md) — moderation queue UI
- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — moderation events
- [Customer Support Tools](https://vibereference.dev/product-and-design/customer-support-tools) — appeals routing
- [VibeReference: Bot Detection Providers](https://vibereference.dev/devops-and-tools/bot-detection-providers) — companion vendor landscape