VibeWeek
Home/Grow/Content Moderation Pipeline: Stop Bad Content Before It Stops Your Business

Content Moderation Pipeline: Stop Bad Content Before It Stops Your Business

⬅️ Day 6: Grow Overview

If your SaaS lets users post anything in 2026 — comments, images, profiles, reviews, custom prompts to AI features, file uploads — you need a moderation pipeline. The threat isn't theoretical: AI-driven spam farms target SaaS at scale; CSAM (child sexual abuse material) is a legal liability for any host; harassment and brigading have killed once-promising platforms. Most indie founders ship "let users post anything; we'll moderate when there's a problem" and pay for it later in support tickets, App Store removal, advertiser flight, or the visit from federal investigators. The fix isn't human review of every item — it's a tiered automated system with human escalation.

A working moderation pipeline answers: what content types need moderation (text / image / video / audio / user actions), what's the threat model per type (spam / harassment / illegal / off-policy), what's auto-handled vs human-reviewed, how do you handle user reports, what's the appeal process, what's logged for legal, and what tools (PhotoDNA / vision APIs / LLM classifiers) do the heavy lifting.

This guide is the implementation playbook for moderation. Companion to CAPTCHA & Bot Protection, Image Upload & Processing Pipeline, Rate Limiting & Abuse, Audit Logs, and Roles & Permissions.

Why Moderation Matters

Get the threat model clear first.

Help me understand the threats.

The categories:

**1. Illegal content (legal risk; mandatory action)**
- CSAM (child sexual abuse material)
- Terrorism / inciting violence
- IP / copyright violations (DMCA)
- Doxing / personally-identifying info weaponized

Required: detect, remove, report (NCMEC for CSAM in US).
Failure mode: criminal liability; site shutdown.

**2. Off-policy / TOS violations (brand risk)**
- NSFW / adult (depending on TOS)
- Hate speech / harassment
- Spam / scam content
- Bot-generated noise
- Promotion of harmful behavior

Required: enforce TOS; remove violations.
Failure mode: brand damage; user flight; advertiser flight.

**3. Quality issues (UX risk)**
- Low-quality posts diluting feed
- AI-slop content (auto-generated low-effort posts)
- Off-topic content in scoped communities
- Duplicate posts

Required: lower visibility / require human review at scale.
Failure mode: platform feel deteriorates; users leave.

**4. AI-prompt abuse (NEW in 2024-26)**
- Jailbreak attempts in prompts
- Generating harmful content via your AI features
- Prompt-injection attacks

Required: input filtering + output filtering on AI features.
Failure mode: AI generates illegal or off-policy content; you're liable.

**5. Targeted harassment**
- Coordinated brigading
- Stalking via your product
- Mass-reporting weaponization

Required: per-user rate limits + pattern detection.
Failure mode: vulnerable users harmed; PR disaster.

For my app:
- Content types
- User base risk profile
- Compliance requirements

Output:
1. Top threats
2. Coverage gaps today
3. Priority order

The biggest unforced error: assuming "we don't have user-generated content" when you do. Profile photos, names, custom-prompts to AI, support tickets, billing addresses — any free-text or media field is moderation surface. Audit fields; pick coverage strategy.

The Pipeline Architecture

Help me design the pipeline.

The 4-stage pipeline:

User submits → Pre-publish filtering → Post-publish monitoring → User reports → Human review ↓ ↓ ↓ ↓ Block / Hold Flag / Remove Triage queue Decision + appeal


**Stage 1: Pre-publish (synchronous)**

Block obviously-bad content before it goes live.

- Hash matching (CSAM via PhotoDNA): block + report
- High-confidence NSFW detection: block + warn
- Spam classifier: hold for review
- Disposable email / new account: friction (CAPTCHA)

Latency budget: 100ms-2s. User waits.

**Stage 2: Post-publish (async)**

For content that passed Stage 1 but needs deeper analysis.

- Run additional moderation (slower / more expensive models)
- Embedding-based clustering (find spam rings)
- AI scoring across all signals
- If flagged: hide pending review

**Stage 3: User reports**

Users flag content they see as problematic.

- "Report" button on every post / comment / profile
- Categorized reasons (harassment / spam / illegal / off-topic)
- Reports go to triage queue
- Auto-action when N reports cross threshold

**Stage 4: Human review queue**

For content escalated by stages 1-3.

- Trust & safety team / contractor reviews
- Decisions: keep / remove / ban user / escalate to legal
- Track reviewer accuracy
- 24-72h SLA on review
- Appeal process for users

**Decision flow per item**:

Item submitted ↓ Pre-publish check

  • High-confidence harmful → BLOCK + log
  • Medium confidence → ALLOW but flag for stage 2
  • Low confidence / clean → ALLOW ↓ Post-publish check (async, 1-60 min)
  • High confidence → REMOVE (if visible) + flag user
  • Medium → SHADOW BAN (visible to author only)
  • Low → leave alone ↓ User reports surface
  • 1 report → QUEUE for review
  • 5+ reports OR trusted reporter → AUTO-HIDE pending review ↓ Human review
  • Confirm: decision sticks
  • Reverse: restore + warn reporter (if false-flagging)

For my product: [content types]

Output:
1. Per-content-type pipeline
2. Latency budgets
3. Tooling per stage

The principle: automated does the volume; humans do the nuance. Auto-blocking high-confidence violations is essential at scale. Auto-blocking medium-confidence kills legit posts. Tier the system.

Text Moderation: The 2026 Stack

Help me set up text moderation.

The 2026 layered approach:

**Layer 1: Allowlist / Denylist (cheap)**

Keyword filters; regex; URL blocklists.

Pros: instant; deterministic
Cons: brittle; bypassable; misses nuance

Use for: known-bad URLs (phishing); obvious slurs (with context-awareness).

**Layer 2: ML classifier (medium)**

OpenAI Moderation API (free), Google Perspective API, AWS Comprehend.

```typescript
const moderation = await openai.moderations.create({
  input: userText,
});

// returns flags by category:
// hate, sexual, violence, self-harm, harassment
const result = moderation.results[0];
if (result.flagged && result.category_scores.sexual > 0.9) {
  // Block
}

Pros: free / cheap (OpenAI Moderation is free); good enough for ~85% of cases Cons: false positives on edge cases; misses context

Use for: first-pass filtering; catching obvious violations.

Layer 3: LLM-based contextual moderation (expensive)

For ambiguous cases, run an LLM:

const decision = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{
    role: 'system',
    content: `You are a content moderator for a SaaS platform.
    Categories of policy violations:
    - Spam (commercial unsolicited)
    - Harassment (targeted abuse)
    - Off-topic
    - Hateful (slurs, dehumanization)
    Reply with JSON: { violation: bool, category: string, confidence: 0-1 }`,
  }, {
    role: 'user',
    content: `Moderate this content:\n\n${text}`
  }],
  response_format: { type: 'json_object' },
});

Pros: nuanced; understands context; multi-lingual Cons: $0.001-0.01/check; latency; potential prompt injection

Use for: borderline cases; multi-lingual content; tone-sensitive judgments.

Layer 4: Embedding-based clustering (offline)

Embed all posts → cluster → identify spam rings (1000 posts, all near-identical embeddings, all from new accounts).

// Periodically:
const embeddings = await getEmbeddings(recentPosts);
const clusters = await dbScan(embeddings);
const suspiciousClusters = clusters.filter(c => 
  c.size > 50 && c.uniqueAuthors < 10 && c.medianAccountAge < 7
);

Catches coordinated spam that individual checks miss.

The orchestration:

async function moderateText(text: string, userId: string) {
  // Layer 1: cheap denylist
  if (denylist.matches(text)) {
    return { decision: 'block', reason: 'denylist' };
  }
  
  // Layer 2: ML classifier (free)
  const ml = await openai.moderations.create({ input: text });
  if (ml.results[0].flagged) {
    const top = getTopCategory(ml.results[0]);
    if (top.score > 0.95) return { decision: 'block', reason: top.category };
    if (top.score > 0.7) return { decision: 'review', reason: top.category };
  }
  
  // Layer 3: LLM only if borderline AND from suspicious user
  if (isUserSuspicious(userId) && ml.results[0].flagged) {
    return await llmModerate(text);
  }
  
  return { decision: 'allow' };
}

For my use case: [text types]

Output:

  1. Layer config
  2. Tools per layer
  3. Cost estimate per 1K items

The cost-killer: **running LLM on every post**. Layer cheaper checks first; reserve LLM for the 5-10% of borderline cases. Total moderation cost should be < 1% of revenue at indie scale.

## Image / Video Moderation

Help me set up image moderation.

The 2026 stack:

Layer 1: Hash matching (CSAM)

PhotoDNA (Microsoft, free for qualified providers) hashes every uploaded image; matches against known CSAM database.

Action on match:

  1. Block immediately
  2. Preserve evidence (DON'T delete the file; preserve for law enforcement)
  3. Report to NCMEC (in US; legal requirement; auto via PhotoDNA partner)
  4. Suspend user account

This is non-optional for any host of user-uploaded images.

Layer 2: NSFW detection

  • AWS Rekognition Moderation: $1/1000 images
  • Google Cloud Vision SafeSearch
  • Hive Moderation
  • Sightengine
  • Cloudflare Images (built-in basic moderation)

Returns categories with confidence:

  • Explicit nudity
  • Suggestive
  • Violence
  • Drugs
  • Gore

Per category: action threshold.

const labels = await rekognition.detectModerationLabels({
  Image: { Bytes: buffer },
  MinConfidence: 60,
});

const blocked = labels.ModerationLabels?.some(l => 
  ['Explicit Nudity', 'Sexual Activity'].includes(l.Name) && l.Confidence > 90
);

Layer 3: Custom violations (your TOS)

Train your own classifier or use Hive's custom categories (logos, weapons, drugs, etc.).

Pre-trained for:

  • Brand marks / logos (impersonation detection)
  • Weapons
  • Drug paraphernalia
  • ID documents (privacy)
  • Selfies of others (revenge porn)

Layer 4: Video moderation

Sample frames every N seconds; run image moderation on samples.

For long videos: extract every 2-5s; run; aggregate decisions.

Tools:

  • Hive Video — handles full video
  • AWS Rekognition Video — frame sampling built-in
  • Sightengine Video

Animated detection:

GIFs / short videos: animated moderation. Tools handle automatically; just enable.

For my pipeline: [image types]

Output:

  1. Per-layer setup
  2. Tool picks
  3. Action thresholds

The legal requirement: **PhotoDNA on every user-uploaded image**. Not optional. Apply for free access (https://www.microsoft.com/photodna). Implement at upload-time. NCMEC reporting is automated by PhotoDNA partners.

## User Reports & Triage Queue

Help me handle user reports.

The system:

The Report button:

Visible on every piece of user-generated content. Click → modal with:

  • Category dropdown (Spam / Harassment / NSFW / Illegal / Other)
  • Optional comment box
  • Report submitted; thank-you message

Schema:

CREATE TABLE content_reports (
  id UUID PRIMARY KEY,
  reporter_user_id UUID NOT NULL REFERENCES users,
  reported_content_id UUID NOT NULL,
  reported_content_type VARCHAR(50) NOT NULL,
  reported_user_id UUID REFERENCES users,
  category VARCHAR(50) NOT NULL,
  comment TEXT,
  status VARCHAR(20) DEFAULT 'pending',  -- pending | reviewed | actioned | dismissed
  reviewed_by UUID REFERENCES users,
  reviewed_at TIMESTAMPTZ,
  decision VARCHAR(50),  -- 'remove', 'keep', 'warn', 'ban'
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON content_reports (status, created_at);
CREATE INDEX ON content_reports (reported_content_id);

Triage rules:

1 report → queue for review (low priority)
3+ reports → priority queue
5+ reports OR trusted reporter → auto-hide pending review
Report from verified-legal-counsel → highest priority + legal review

Reporter weighting:

Track reporter accuracy:

  • Reports that led to action: positive signal
  • Reports dismissed: neutral
  • Repeated false reports: suspicious; deweight
ALTER TABLE users ADD COLUMN reporter_accuracy DECIMAL(3,2) DEFAULT 0.5;

Update on each review decision.

A user with 90% report accuracy = trusted; their reports auto-flag. A user with 10% accuracy = malicious reporter; their reports deweighted or ignored.

Anti-weaponization:

  • Per-user report rate limit (e.g. 10/day)
  • Cooldown on reporting same user (1/week per target)
  • Patterns: "User A reports User B 50x in 24h" = auto-flag for harassment investigation

Response to reporter:

After review:

  • "Thanks for your report. We took action: [removed]"
  • Or "We reviewed and didn't find a violation. Thanks anyway."

This validates the system; encourages future reports.

For my product: [reporter base]

Output:

  1. Reports schema
  2. Triage rules
  3. Anti-weaponization
  4. Reporter feedback

The discipline: **respond to every report within 48 hours**, even if "no action." Silence makes users feel reports are ignored; they stop reporting; bad content piles up. Automated "report received" + human review with status update wins.

## Human Review Queue

Help me run a review queue.

The queue UI (simple):

[Item]
[Reporter info: trusted / new / repeat reporter]
[Categories flagged]
[Auto-moderation scores]
[Author history: account age, prior actions]
[Buttons: Keep | Hide | Remove | Ban User | Escalate]
[Notes box]

Reviewer per-item time: 30-60 seconds for clear cases; 2-5 min for ambiguous.

SLA targets:

  • Illegal content: <4 hours
  • Active harassment: <12 hours
  • TOS violations: <48 hours
  • Quality / spam: <72 hours

Staffing:

Coverage hours:

  • Pre-revenue: founder reviews; 30-60 min/day
  • $100K-1M ARR: contractor 4-8h/week
  • $1M-10M ARR: 1-2 part-time T&S contractors
  • $10M+ ARR: T&S team

Outsourcing:

  • Hive Trust & Safety — managed reviewer pool
  • Pactera Edge — outsourced moderation
  • Concentrix — bigger BPO

DIY platforms:

  • Modulate.ai — voice moderation
  • Spectrum Labs — text + image moderation platforms
  • Hive Trust & Safety — review queue UI + AI assist

Reviewer training:

  • Day 1: TOS reading; calibration on 50 example items
  • Day 2-7: shadowed reviews; lead reviews own
  • Ongoing: weekly calibration on new edge cases
  • Mental health: rotation; not 8h/day on harmful content; counseling access

Reviewer accuracy tracking:

Random-sample reviewed items; senior reviewer audits. Score reviewers; coach low-accuracy.

Decision logging:

CREATE TABLE moderation_decisions (
  id UUID PRIMARY KEY,
  content_id UUID NOT NULL,
  reviewer_id UUID NOT NULL,
  decision VARCHAR(50) NOT NULL,
  reason VARCHAR(100),
  notes TEXT,
  reviewed_at TIMESTAMPTZ DEFAULT NOW(),
  appealed_at TIMESTAMPTZ,
  appeal_decision VARCHAR(50)
);

For my T&S team: [stage]

Output:

  1. Queue UI requirements
  2. SLA per category
  3. Staffing plan
  4. Training docs

The hidden cost: **secondary trauma** for reviewers. Burnout is high; turnover is high; counseling is non-optional. Even at small scale, watch for it. Outsourcing CSAM review specifically is the norm — don't make a junior engineer do it.

## Appeals Process

Help me set up appeals.

The legal context:

EU Digital Services Act (DSA): requires meaningful appeal mechanism for content removals. California / NYC laws: similar requirements emerging. Best practice everywhere: appeals build user trust.

The flow:

Content removed → User notified → "Appeal" button
              ↓
       User explains why
              ↓
       Different reviewer (not original)
              ↓
       Decision: uphold / overturn
              ↓
       User notified

The notification:

Subject: Your content was removed

Hi [user],

We removed your [post/comment/image] for [reason].

If you believe this was an error, you can appeal:
[Appeal button]

Your appeal will be reviewed by a different team member within [N] business days.

Read our content policy: [link]
Read our enforcement guidelines: [link]

Appeal review:

  • Different reviewer than original
  • Reviews original content + reporter info + appeal reasoning
  • Decision: uphold (final) or overturn (restore + apologize)

Track:

  • Appeal volume
  • Overturn rate (>15% = original moderation too aggressive)
  • Reviewer agreement rate (low = inconsistent rubric)

Repeat appeals:

User appeals → original reviewer's decision overturned. If pattern emerges (one reviewer with high overturn rate), retrain or remove.

Bad-faith appeals:

Limit appeals per user per month. After 5 dismissed appeals: deweight future ones.

For my product:

  • Appeal volume expected
  • Reviewer process

Output:

  1. Appeal flow
  2. Notification template
  3. Tracking

The discipline DSA-compliant: **a different reviewer handles the appeal**. Original reviewer can't review their own decision. Documented; auditable; defensible to regulators.

## Logging for Legal & Compliance

Help me log for legal.

The retention:

For ALL moderation actions, log:

  • Content ID + content snapshot (preserve even after deletion for legal)
  • Reviewer ID + decision + timestamp
  • Reason / category
  • Appeals + outcomes
CREATE TABLE moderation_audit (
  id UUID PRIMARY KEY,
  event VARCHAR(50) NOT NULL,  -- 'submitted', 'auto-blocked', 'reviewed', 'removed', 'appealed', 'restored'
  content_id UUID,
  user_id UUID,
  reviewer_id UUID,
  decision VARCHAR(50),
  reason VARCHAR(100),
  metadata JSONB,
  ip INET,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Retention:

  • Action logs: 7 years (evidence)
  • Removed content: at minimum required for legal (varies; CSAM evidence permanent until law enforcement says otherwise)
  • User reports: 3 years

Specific legal compliances:

CSAM (US): Mandatory PhotoDNA + NCMEC reporting. Preserve evidence; don't delete. Federal law.

EU DSA: Transparency reports (volume of removals, categories, appeals); Trusted Flaggers; meaningful appeal.

EU Trust & Safety: GDPR + DSA aligned. Designated representative for non-EU companies serving EU.

UK Online Safety Act: similar to DSA; requires risk assessments.

Section 230 (US): protects platforms from liability for user content WITH "good faith" moderation. Document moderation; don't appear to encourage violations.

For my company:

  • Jurisdictions
  • Content types

Output:

  1. Audit schema
  2. Retention policy
  3. Compliance per jurisdiction
  4. Reporting cadence

The legal must-do: **retain CSAM evidence**. If you discover CSAM, do NOT delete. Block from view; preserve file; report to NCMEC; await law enforcement direction. Deleting destroys evidence; carries criminal liability.

## Common Moderation Mistakes

Help me avoid mistakes.

The 10 mistakes:

1. "We don't have UGC" denial You do. Profile photos, names, AI prompts, support tickets. Audit.

2. Skipping PhotoDNA on image upload CSAM legal liability; non-negotiable.

3. No human review escalation path Auto-only = both over- and under-removal at scale.

4. Reviewer burnout / no rotation T&S work is traumatic; rotation + counseling are required.

5. No appeal process DSA / CA / NY laws + user trust both demand appeals.

6. Reporter weaponization Mass-reporting one user shouldn't auto-ban; build resilience.

7. Not logging removed content Removed → deleted → no evidence; legal exposure.

8. Treating moderation as "TOS update" TOS is paper; pipeline is operational. Need both.

9. Public transparency report missing Required by DSA + good for trust; publish quarterly.

10. Ignoring AI-prompt abuse If users prompt YOUR AI features, harmful generations are YOUR liability.

For my product: [risks]

Output:

  1. Top 3 risks
  2. Mitigations
  3. Process changes

The mistake that ends careers: **delaying CSAM action**. Discover → act in minutes, not hours. Preserve evidence; report; suspend. Document everything. Federal investigators don't care about your sprint planning.

## What Done Looks Like

A working moderation pipeline delivers:
- Pre-publish + post-publish + user-report + human-review tiers
- PhotoDNA + NCMEC for CSAM (mandatory)
- ML classifier (OpenAI / AWS / Google) for first-pass text
- Vision API (AWS / Google / Hive) for first-pass image
- LLM for borderline / nuanced cases
- User report button on every piece of UGC
- Triage queue with SLA (4h illegal / 48h TOS)
- Different-reviewer appeal process
- Logging + 7-year retention for legal
- Quarterly transparency report
- Reviewer training + rotation + counseling
- Reporter accuracy weighting; weaponization detection

The proof you got it right: harmful content is rare; reports get responded to within SLA; appeals are <15% overturn rate; transparency report shows the system working; you can survive the day a journalist asks "what's your moderation process?" without panic.

## See Also

- [CAPTCHA & Bot Protection](captcha-bot-protection-chat.md) — companion abuse-protection layer
- [Image Upload & Processing Pipeline](image-upload-processing-pipeline-chat.md) — moderation at upload time
- [Rate Limiting & Abuse](rate-limiting-abuse-chat.md) — companion rate-limit layer
- [Audit Logs](audit-logs-chat.md) — moderation actions feed audit
- [Roles & Permissions](roles-permissions-chat.md) — reviewer permissions
- [Account Deletion & Data Export](account-deletion-data-export-chat.md) — banned-user cleanup
- [Customer Support Chat](customer-support-chat.md) — appeals flow into support
- [Internal Admin Tools](internal-admin-tools-chat.md) — moderation queue UI
- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — moderation events
- [Customer Support Tools](https://vibereference.dev/product-and-design/customer-support-tools) — appeals routing
- [VibeReference: Bot Detection Providers](https://vibereference.dev/devops-and-tools/bot-detection-providers) — companion vendor landscape