VibeWeek
Home/Grow/Video Upload & Processing Pipeline: Ship User-Generated Video Without Building Your Own Encoder

Video Upload & Processing Pipeline: Ship User-Generated Video Without Building Your Own Encoder

⬅️ Growth Overview

Video Pipeline Strategy for Your New SaaS

Goal: Let users upload video into your product and play it back smoothly across devices and connection speeds — without building a transcoding farm, debugging FFmpeg flags at 2 a.m., or accidentally serving 4K originals to phones on cellular. Pick a managed video platform that fits your scale and use case (Mux, Cloudflare Stream, Bunny Stream, AWS MediaConvert + S3 + CloudFront, or DIY FFmpeg only as a last resort), design the upload + processing + playback flow as one coherent pipeline (not four bolted-on services), and treat captions, thumbnails, analytics, and DRM as first-class concerns from day one — not retrofits. Avoid the founder failure modes where you store originals and stream them as-is (broken on mobile, eats bandwidth), where you build "a quick FFmpeg microservice" (it sprawls into a full encoder farm), or where you ship without captions/accessibility (legal exposure plus ~30% of users watch with sound off).

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Working upload → playback flow with a managed platform in 2-3 days. Full pipeline (multi-bitrate, captions, thumbnails, analytics, security) in 1-2 weeks. DIY pipeline with FFmpeg + S3: weeks-to-months and ongoing operational tax.


Why "Just Upload to S3 and Play It" Doesn't Work

Four failure modes hit founders the same way:

  • Serving raw uploads as MP4 from S3. Works for 1080p clips watched on Wi-Fi laptops. Falls apart for: anyone on cellular (massive bandwidth cost; stuttering); seeking (browsers download the entire file to seek; awful UX); large files (the user uploads a 4GB phone video; you serve it to everyone forever); accessibility (no captions, no transcripts).
  • "We'll add transcoding later." "Later" means: a year of bigger bandwidth bills, broken playback complaints in support, customers churning because mobile playback stutters. Retrofitting transcoding into a working product is harder than doing it from day one.
  • DIY FFmpeg microservice. Starts as a Cloud Run job. Six months later it's a spot-fleet of GPUs, a queue, retries, dead-letter handling, captions extraction, thumbnail generation, format detection, codec failures from weird phone recordings, and a permanent on-call rotation. You've built the wrong company.
  • No captions / accessibility. ADA / EN-301-549 / Section 508 expose you to legal risk. ~30% of users watch on mute. Auto-captions exist as a managed feature; not turning them on is leaving free wins on the table.

The version that works: pick a managed video platform, design upload-direct-to-platform (don't proxy through your servers), define adaptive bitrate output, enable captions, instrument analytics, secure playback (signed URLs / DRM where required), and iterate on the parts that matter to YOUR product (player skin, thumbnails) instead of the parts that don't (codecs, encoder ladders).

This guide assumes you have already considered File Uploads, Image Upload Processing Pipeline (similar pattern, simpler problem), File Storage / Multi-Region, Background Jobs / Queue Management, Long-Running Operations / Job Status UI, and Multi-Tenancy (video assets must be tenant-scoped). Cross-reference Public Share Links / Permissioned Sharing (often the playback flow), API HTTP Caching (CDN considerations), and Customer Notes / Internal Annotations (transcripts often integrate with notes/search).


1. Decide What "Video" Means for Your Product

Different video shapes need different platforms.

Help me decide which video shape my product is. The shapes:

**Shape 1: User-generated short clips (asynchronous)**
- Users upload videos that other users (or themselves) watch later
- Examples: video messages, recorded demos, training videos, marketplace
  product videos, social posts
- Length: typically <5 minutes
- Volume per user: low to moderate
- Stakes: medium (users notice playback quality)

**Shape 2: Long-form recorded content**
- Webinars, courses, recorded meetings, podcast episodes with video
- Length: 10-60+ minutes
- Volume: lower per upload but bigger files
- Stakes: high (users won't tolerate stuttering on long videos)

**Shape 3: Live streaming**
- Real-time broadcast: live workshops, town halls, live customer support,
  events
- Stakes: extreme (a glitch during a live event is brutal)
- Common adjacent need: chat, reactions, live polling

**Shape 4: Real-time video calls (1:1 or N:N)**
- WebRTC video conferencing
- Different infrastructure entirely (LiveKit, Daily, 100ms, Twilio Video,
  Agora) — NOT this article's focus

**Shape 5: Short-loop video (auto-play, muted, looping)**
- Hero videos, product demos on landing pages, GIF replacements
- Often loaded as `<video autoplay muted loop>`
- Want: fast first-frame, small file, no audio track

My product:
- What's the use case? [shape from above]
- Who watches the video? [the uploader / other users / public]
- Length distribution? [...]
- Volume: how many uploads per day across all users? [...]
- Mobile vs. desktop balance? [...]
- Accessibility / regulatory requirements? [ADA / Section 508 / EU EAA / none]
- DRM / piracy concerns? [yes — premium content / no]
- Live: do I need it on day one or can I add later? [...]

Tell me:
1. The shape my product fits
2. Whether I should build live + on-demand together or sequence them
3. The minimum viable video pipeline for shape #1 / #2
4. The thing I should explicitly NOT build for v1

Picking heuristic: Shape 1 + 2 (on-demand video) is the common B2B SaaS need; managed platforms solve it cleanly. Shape 3 (live) is its own complexity tier — do it only when the product demands it. Shape 4 (calls) is a different stack entirely.


2. Pick a Managed Video Platform

Don't build FFmpeg infrastructure. The managed platforms are mature.

Help me pick a managed video platform. My constraints:
[fill in from the checklist below]

The leading options and their sweet spots:

**Mux**
- Developer-first; great DX; extensive analytics built in (Mux Data)
- Direct uploads from the browser supported (Mux Direct Upload)
- Per-minute pricing — clean for low-to-mid volume
- Strong API + great docs
- HLS-first; signed URLs; live + on-demand
- Best for: indie / mid-market wanting strong DX and analytics

**Cloudflare Stream**
- Bundled with Cloudflare CDN (already cheap egress); flat per-minute pricing
- Streams stored & served from Cloudflare; no separate origin cost
- Watermarking, signed URLs, live, basic analytics
- Less feature-rich than Mux on analytics; cheaper at scale
- Best for: cost-sensitive workloads; teams already on Cloudflare

**Bunny Stream**
- Aggressive pricing; bunch.net's CDN
- Good for high-volume, lower-margin use cases
- Less polished than Mux on advanced features
- Best for: cost-driven; long-tail / high-volume / less-premium content

**AWS MediaConvert + S3 + CloudFront (DIY but managed pieces)**
- Use AWS-native services: MediaConvert for transcoding, S3 for storage,
  CloudFront for delivery
- More work to glue together; gives you full control
- Pricing is usage-based; can be cheap at scale; complex to estimate
- Best for: existing AWS-heavy stacks; specialized requirements

**api.video**
- French-based; strong in EU; good DX
- Player + analytics + transcoding bundled
- Best for: EU residency / regional play

**Vimeo OTT / Wistia**
- More marketing-video / business-video oriented
- Higher prices; less developer-flexible
- Best for: video as a product (online courses, paid memberships) — not
  as a feature inside another product

**Pexip / Theta / DeepWave / Coconut / Bitmovin Encoding**
- B2B encoder-as-a-service (you'd still serve via your own CDN)
- Use when you want unbundled encoder + your own delivery infra

**DIY FFmpeg + your own infra**
- Last resort; only if you have unusual codec requirements or
  cost-at-scale forces it
- Budget: 1-2 senior engineers ongoing

My situation:
- Volume per month (minutes uploaded × minutes watched): [...]
- Budget for video infra per month: [...]
- DRM / signed URLs required: [...]
- Live streaming required: [day one / later / never]
- EU residency required: [...]
- Existing CDN: [Cloudflare / CloudFront / Fastly / none]
- Team experience with media stacks: [zero / some / extensive]

Tell me:
1. Top pick + 1 backup
2. The wrong pick I should explicitly avoid given my constraints
3. The sign-up + integration steps for the top pick
4. The pricing trap to watch for at my expected scale

Decision heuristic:

  • Default for indie / startup with strong DX: Mux
  • On Cloudflare already + cost-sensitive: Cloudflare Stream
  • High-volume / cheaper margins: Bunny Stream
  • Heavy AWS shop / complex requirements: MediaConvert + S3 + CloudFront
  • EU residency: api.video or Cloudflare Stream (with EU options)

3. Build the Upload Flow (Direct-to-Platform)

Don't proxy uploads through your servers. Have the client upload directly to the video platform.

I want to ship video upload that:
- Goes DIRECTLY from the user's browser to the video platform
- Doesn't proxy through my server (saves bandwidth, time, complexity)
- Supports resumable uploads for big files (tus protocol, or platform-specific
  resumable APIs)
- Shows accurate progress in the UI
- Works on mobile (iOS Safari is picky)
- Validates file size + format before upload starts

Architecture:
1. **Server creates an upload URL**: my server hits the video platform's
   "create direct upload" API, gets back a one-time upload URL + asset ID,
   stores the asset ID in my DB tied to the user/workspace, returns the
   upload URL to the client. The user never sees my video platform's API key.
2. **Client uploads directly** to the platform URL using their browser's
   fetch() with progress tracking, OR a tus client library (uppy.io is the
   gold standard).
3. **Platform processes the video asynchronously**: transcodes, generates
   thumbnails, extracts captions.
4. **Platform notifies my server via webhook** when ready: I update my DB
   record with status=ready and the playback URL/ID.
5. **Client polls my server** OR receives a real-time push (WebSocket / SSE)
   when the asset is ready; UI transitions from "processing" to "ready."

Build me:
- The `POST /assets/upload-init` endpoint that creates the upload URL
- The client uploader using uppy.io (or equivalent) with progress + retry
- The webhook handler `POST /webhooks/video-platform` that verifies signatures
  and updates DB state
- A polling fallback for clients that can't use WebSockets
- The DB schema: assets (id, workspace_id, owner_id, source_filename,
  file_size_bytes, duration_seconds, status, platform_asset_id,
  playback_id, thumbnail_url, error, created_at, processed_at)

Validation BEFORE upload:
- Max file size (e.g., 5GB) — stop the upload before it starts
- Allowed formats: mp4, mov, mkv, webm, avi (let the platform handle codec
  weirdness; just block obviously wrong types like .exe)
- User quota check: has this user/workspace exceeded their plan's video
  storage quota? (See [Quotas / Limits / Plan Enforcement](quotas-limits-plan-enforcement-chat.md))

Mobile-specific:
- iOS Safari: video uploads from the camera produce HEIC/HEVC formats; the
  managed platform handles these — but make sure your file-input accept attr
  allows them
- Mobile uploads can drop on connection changes; tus resumable protocol is
  the answer

Critical: never expose the video platform's API key to the client. Only the
one-time signed upload URL.

Trap to flag: many founders proxy uploads through their own server "for security." Don't. It doubles bandwidth costs, doubles latency, and creates a useless bottleneck. Use one-time signed upload URLs.


4. The Processing Pipeline (Adaptive Bitrate, Captions, Thumbnails)

Once uploaded, the platform transcodes. Configure what it produces.

Configure the video processing pipeline. The required outputs:

**Adaptive bitrate (ABR) HLS**
- Multiple renditions: 240p, 360p, 480p, 720p, 1080p (skip 1440p/4K unless
  your product justifies it — costs scale)
- Player switches automatically based on connection
- Format: HLS (.m3u8 manifest + .ts or fMP4 segments) — the universal default
- DASH as a secondary option only if your players need it

**Captions / subtitles**
- Auto-generated captions (managed platforms include this; Mux uses Cloud
  Speech-to-Text or similar, Cloudflare Stream uses their own)
- Languages: English baseline; other languages by detection or by user
  selection
- Format: WebVTT (.vtt) — universal browser support
- Editable: surface a UI for the uploader (or admin) to fix captions
  errors; persist the edited VTT
- Compliance: required for ADA, EN-301-549, Section 508 in many jurisdictions

**Transcript**
- Full-text transcript extracted from captions
- Use cases: search across video content, SEO, "find the part where they
  said X"
- Store as JSONB or in your search index ([Search](search-chat.md))

**Thumbnails**
- Auto-extract: poster frame at 0:00 (or first non-black frame); strip of
  thumbnails at fixed intervals for the seek-bar preview
- Custom: let the user pick a custom poster frame from the strip
- Animated GIF / preview: a short (3-6s) animated preview for hover states

**Audio extraction**
- Often you also want the audio-only version (podcast feed, transcription,
  archive)
- Most platforms produce this automatically; opt-in if not

**Optional: DRM**
- Widevine (Chrome / Android), FairPlay (Safari / iOS), PlayReady (Edge /
  Xbox)
- Required only for premium content / piracy-sensitive workloads
- Adds significant cost + complexity; skip unless explicitly needed

**Optional: Watermarking**
- Visible (logo overlay, user ID stamp) for piracy deterrence
- Forensic / per-user watermarking (different watermark for each viewer) for
  content traceability — premium feature

**Configure what NOT to produce**
- Skip 4K renditions unless 4K matters to your product (encoding + storage
  + bandwidth all scale)
- Skip multi-language captions for v1 unless required
- Skip DRM for v1 unless required

For my use case:
[describe and adjust the above settings]

Build me:
- The platform-side configuration (Mux setup / Cloudflare Stream config /
  MediaConvert preset)
- The DB schema for caption tracks, transcripts, thumbnails
- The UI for editing auto-captions
- The transcript-search integration

Cost trap: every additional rendition + caption language + DRM provider compounds your per-minute cost. Pick the minimum viable bitrate ladder and language list.


5. The Player and Playback Flow

The player is what users see. Design it deliberately.

Build the playback experience. Components:

**Player choice**
- Mux Player (if on Mux): pre-built, themable, supports HLS + captions +
  analytics; ~50KB
- Cloudflare Stream Player: works out-of-box; less customizable
- Video.js: open-source player with rich plugin ecosystem; works with any
  HLS source
- Plyr: lightweight, design-friendly OSS player
- HLS.js: low-level if you want a fully custom player UI
- Native HTML5 `<video>`: works for MP4 but doesn't do HLS without a polyfill

For most products: use the platform's player on day one (fastest to ship);
swap to a custom player later if needed.

**Playback URL strategy**
- Public videos: direct playback URL works
- Private videos (this is most B2B SaaS): use signed playback URLs OR signed
  cookies — never expose the raw playback URL to non-authorized users
- Signed playback URLs: server generates a short-lived signed URL when a
  user is authorized to play; client uses that URL
- Token expiry: 1-4 hours typical; long enough for the user to start
  watching, short enough that a leaked link expires fast
- Per-user tokens: tie the URL to the user (so they can't share); check on
  request

**Player UX**
- Custom poster image (from the thumbnail strip)
- Loading skeleton matching your design system
- Captions toggle visible by default
- Playback speed control (0.5x to 2x)
- Quality selector (auto + manual override)
- Fullscreen
- Picture-in-picture
- Keyboard shortcuts (space=play/pause, arrows=seek, m=mute, f=fullscreen)
- Resume-from-where-you-left-off (track playback position per user; resume
  on next view)
- Watch progress indicator (for course/training video products)

**Mobile considerations**
- iOS plays inline by default if you set `playsinline`; otherwise it goes
  fullscreen — this matters for embedded video UI
- Android hardware decoding varies; prefer H.264 baseline + HEVC for newer
  devices
- Test on real devices, not just simulators

**Accessibility**
- Captions on by default for users who have it set in their accessibility
  preferences (`prefers-reduced-motion`, captions preferences)
- Keyboard navigation for all controls
- Screen-reader labels on all controls
- Audio descriptions for videos where visual context matters

Build me:
- The player component with the controls above
- The signed-URL generation function on the server
- The progress-tracking endpoint (POST /assets/:id/progress with seconds_watched)
- The "resume where you left off" UX

6. Security and Access Control

Video pipelines have specific access-control concerns.

Help me secure the video pipeline.

**1. Tenant isolation**
- Every asset is scoped by workspace_id (and optionally user_id)
- Playback authorization: my server checks the requesting user has access
  to the asset BEFORE generating a signed playback URL
- Test: two users in different workspaces cannot guess each other's playback
  IDs and play

**2. Signed playback URLs**
- Default for private content
- Server-side: generate token signed with a secret, scope to the asset_id
  + user_id + expiry
- Client receives the URL; uses it; expires
- Never embed the secret in the client

**3. Hot-link protection**
- Even with signed URLs, set Referrer policies; restrict allowed domains
  on the player; configure the platform's CORS / referer rules to your
  domain only

**4. Watermarking (forensic)**
- For premium / leak-sensitive content: per-viewer watermark (user email
  overlay) deters sharing
- Trade-off: visible watermarks degrade UX

**5. DRM**
- Only when truly required (paid content, regulated content)
- Adds: license-server costs, compatibility burden (Safari needs FairPlay,
  Chrome needs Widevine), debugging complexity
- Most B2B SaaS DOESN'T need DRM; signed URLs are sufficient

**6. Upload abuse**
- Rate limit uploads per user per day
- Rate limit total bandwidth per workspace per month (plan-tier enforcement)
- Detect abusive content: scan uploads via [AI Moderation Trust Safety](https://www.vibereference.com/ai-development/ai-moderation-trust-safety-platforms)
  or the platform's built-in moderation (Mux, Cloudflare Stream offer NSFW
  detection)

**7. Deletion / expiration**
- Soft-delete assets first (tombstone in DB; mark on platform as deleted)
- Platform-side: set retention rules; assets older than X / unused for Y
  auto-archive or delete (per workspace policy)
- DSAR / GDPR: when a user requests deletion, hard-delete their videos
  including transcripts and captions

**8. Audit trail**
- Log every playback request: who watched, when, from where, how long
- Log every upload, edit, delete
- See [Audit Logs](audit-logs-chat.md) for the broader pattern

Build me:
- The signed-URL function with tenant + user check
- The upload abuse rate limiter
- The deletion path that wipes asset + transcripts + thumbnails
- The audit log integration

7. Analytics — Know What Users Watch

Without analytics, you can't tell whether your video product works.

Set up video analytics. The metrics that matter:

**Per-asset metrics**
- Views (unique + total)
- Average watch time (seconds + % of total length)
- Completion rate (% who watch to the end — define "end" as 95% of duration
  to avoid credit-card-hangup scenarios)
- Drop-off curve: percentage of viewers still watching at each second
- Quality of experience: rebuffer rate, time-to-first-frame, error rate

**Per-viewer metrics**
- How many videos this user has watched
- Total minutes watched in the last 7/30 days
- Most-watched topics (if you tag content)
- Drop-off pattern (do they finish or bail?)

**Cohort metrics**
- Activation: % of new users who watch at least one video
- Engagement: 7-day return-rate of users who watch
- Retention: are video-watchers more likely to retain than non-watchers?

**Quality-of-experience (QoE) metrics — these are your video health KPIs**
- Rebuffer ratio: fraction of playback time spent rebuffering
- Startup time: ms from play-click to first frame
- Bitrate: average bitrate served (higher = better quality)
- Errors: playback errors per session

**Platforms**
- Mux Data: best-in-class video analytics; built into Mux platform; can
  also be used standalone with other players
- Cloudflare Stream Analytics: built-in; less detailed
- Self-instrumented: send playback events to Posthog / Amplitude / Segment
  with custom event names

**Dashboards**
- For the founder/PM: top videos by views + completion rate
- For ops: QoE health (rebuffer rate by region, error rate trends)
- For customer success: per-customer engagement (which customers watch a
  lot, which don't)

Build me:
- Mux Data integration (or equivalent)
- A dashboard query for top-performing assets by completion rate
- An alert when QoE degrades (rebuffer rate >2% sustained)
- Per-user / per-workspace usage reports

8. Cost Modeling — Don't Get Surprised

Video bills surprise teams. Model the costs before you ship.

Help me model my video infra costs at three growth stages.

**Cost components for a managed platform (Mux / Cloudflare Stream / Bunny
Stream)**
- Storage: per minute of video stored, per month
- Encoding: per minute of video uploaded (transcoded once)
- Delivery: per minute of video streamed (variable based on bitrate and
  region) — this is the dominant cost at scale
- Live streaming: per concurrent viewer + per minute (separate from VOD)
- Analytics: usually included; check per-event quotas

**Stage 1: Beta / early adopter (100 users, low usage)**
- 50 hours uploaded per month
- 200 hours watched per month
- Estimate cost: $20-100/month most platforms

**Stage 2: Growth (1000 users)**
- 500 hours uploaded per month
- 5000 hours watched per month
- Estimate cost: $200-2000/month depending on platform + bitrate

**Stage 3: Scale (10,000 users)**
- 5000 hours uploaded per month
- 75,000 hours watched per month
- Estimate cost: $5000-30,000/month
- At this scale, comparison-shop annually; rough order Cloudflare Stream <
  Bunny Stream < Mux on raw cost; analytics + features differ

**Cost reduction levers (in order)**
1. Drop unnecessary bitrate ladders (skip 4K, maybe skip 1440p)
2. Drop low-played renditions (if 240p never plays, remove it)
3. Set storage retention (auto-delete unused old uploads after N days/months)
4. Use efficient codecs (H.264 baseline; HEVC / AV1 where supported, but
   compatibility cost)
5. Compare platforms; switching is non-trivial but not impossible
6. Consider DIY pipeline only at very high scale where engineering cost
   amortizes

**Plan-tier billing (your customers)**
- Storage per workspace: cap as part of plan; charge for overage
- Bandwidth per workspace: cap and overage
- Transcoding minutes: cap and overage
- Surface usage in [Quotas / Limits / Plan Enforcement](quotas-limits-plan-enforcement-chat.md)
- Dashboard for the customer to see their video usage

Build me:
- A cost-projection spreadsheet calculator with my expected volumes
- An alert when monthly platform cost exceeds budget
- A per-workspace usage tracker that I can surface in admin

9. What Done Looks Like

You have shipped a real video pipeline when:

  • A user can upload a 1GB video from their phone, see real-time progress, leave the page, come back and find it processing or ready, and play it back at adaptive bitrate.
  • Captions are auto-generated and editable; videos are ADA / EN-301-549 compliant.
  • Playback is fast (TTF first frame <2s on typical connections) and stable (rebuffer ratio <2%).
  • Private videos require signed URLs; cross-tenant access is impossible (verified by tests).
  • Analytics dashboard shows views, watch time, completion rate, and QoE per asset and per cohort.
  • A user can request deletion; their videos + transcripts + thumbnails are removed within 24h.
  • Cost per workspace is tracked and surfaced; plan-tier limits enforced.
  • Mobile + desktop + accessible playback all tested with real users.
  • A new engineer can read this doc + your platform config and understand the upload-to-playback flow end-to-end.
  • Ops alerts fire when QoE degrades (regional rebuffer spikes, error rate jumps, processing-job lag).

Mistakes to Avoid

  • Building FFmpeg yourself. Becomes a sub-company. Adopt a managed platform until scale truly forces DIY.
  • Proxying uploads through your server. Doubles bandwidth, doubles latency, useless. Use direct-to-platform uploads with signed URLs.
  • Storing originals and serving them as-is. Adaptive bitrate exists for a reason; you must transcode.
  • Skipping captions. Legal + accessibility risk; ~30% of users watch on mute.
  • Encoding 4K renditions you don't need. Cost without value.
  • Public playback URLs for private content. Sign every playback URL for private assets.
  • No tenant scoping. Cross-workspace video leakage; among the worst data leaks possible.
  • No QoE monitoring. "It plays fine on my laptop" is not enough. Watch rebuffer ratio + startup time + error rate per region.
  • Adding DRM you don't need. Adds cost, complexity, support burden. Skip unless premium content requires it.
  • No deletion path. GDPR / DSAR compliance failure waiting to happen.
  • Ignoring mobile. Especially iOS Safari; test there; respect playsinline.
  • No upload-side validation. Users will upload .exe files. Validate file size + format before the upload starts.
  • Webhook signature not verified. Fake webhook calls can mark assets as ready (or as processing forever).
  • No retention policy. Old uploads accumulate; storage cost grows; users don't notice; finance does.
  • Surfacing cost per minute to the customer obsessively. Some customers will optimize for low cost (delete everything, won't try features). Build cost tiers + usage limits; don't shame users with line-item invoices.
  • Skipping accessibility. Captions, screen-reader labels, keyboard nav, audio descriptions where appropriate. Required by law in many jurisdictions, common-sense everywhere.

See Also