Markdown Rendering & HTML Sanitization

⬅️ Day 6: Grow Overview

If you're displaying user-authored content in your B2B SaaS — comments, notes, descriptions, AI chat responses, knowledge-base articles, customer support replies — you need to render markdown / rich text safely. The naive approach: dangerouslySetInnerHTML={{ __html: userContent }} and pray. The structured approach: pick a markdown renderer (react-markdown / remark / marked), sanitize the output (DOMPurify / sanitize-html), allowlist specific HTML elements + attributes, prevent XSS, render code blocks with syntax highlighting, and handle edge cases (tables, embedded images, links). Get this wrong and you ship XSS vulnerabilities.

1. Decide rendering pipeline

Pick a markdown rendering pipeline.

React stack:

react-markdown (recommended default):
- React-native renderer
- Plugin ecosystem (remark + rehype)
- Sanitization built-in
- Most-used in 2026 React projects

marked:
- Generic JS markdown parser → HTML string
- Pair with DOMPurify for sanitization
- Used in non-React contexts

markdown-it:
- Configurable parser
- Plugin ecosystem
- Pair with DOMPurify

mdx (markdown + JSX):
- For static content (docs, blogs)
- Build-time processing
- Don't use for user input (user can't write JSX safely)

remark / rehype (plumbing):
- Lower-level toolkit
- Plugins for syntax highlighting, math, footnotes, GFM tables, etc.
- Used by react-markdown and many docs frameworks

Pipeline:
- Source: markdown string
- Parse: remark (markdown → mdast)
- Transform: remark plugins (GFM, math, etc.)
- Convert: remark-rehype (mdast → hast)
- Transform HTML: rehype plugins (sanitize, syntax highlight, etc.)
- Output: React tree (react-markdown) OR HTML string (rehype-stringify)

Recommendation for 2026 React:
- User-generated content → react-markdown + remark-gfm + rehype-sanitize
- Static docs → MDX + Next.js / Astro
- Server-side rendering → unified pipeline server-side, hydrate client-side

Output:
1. Recommended pipeline for [USE CASE]
2. Plugin choices
3. Bundle-size estimate
4. Security posture (sanitize before render)
5. SSR compatibility

The 2026 default for B2B SaaS displaying user content: react-markdown + remark-gfm + rehype-sanitize. Three packages; ~50KB; covers 95% of needs.

2. Sanitize HTML — defense against XSS

The single most-important rule: never trust user-submitted HTML.

Sanitize HTML output.

Threat model:
- User submits markdown that converts to HTML
- HTML may contain <script>, <iframe>, on* attributes (onerror, onclick), javascript: URLs
- Without sanitization: XSS — attacker JS runs in victim's browser

Sanitization libraries:

DOMPurify (recommended):
- Mature; maintained; broadly used
- Works in browser + Node (with jsdom)
- Allowlist-based by default
- ~50KB

sanitize-html (Node):
- Server-side sanitization
- Deep configuration
- ~30KB

rehype-sanitize:
- Within the unified pipeline
- Schema-based allowlist
- Integrated with react-markdown

Server-vs-client:
- Best practice: sanitize on both sides (defense in depth)
- Server: sanitize before storing OR before sending
- Client: sanitize before rendering (catches anything that slipped through)

Allowlist approach:
- DEFAULT-DENY (safer)
- Explicitly allow: safe elements (<p>, <strong>, <em>, <ul>, <ol>, <li>, <h1-6>, <a>, <code>, <pre>, <blockquote>, <table>, <img> w/ restricted src)
- Explicitly allow: safe attributes (class, id, href, src, alt, title)
- Block: <script>, <iframe>, <object>, <embed>, <style>
- Block: on* event handlers (onclick, onerror, etc.)
- Block: javascript: URLs in href / src

URL filtering:
- Allow: http://, https://, mailto:, relative paths
- Block: javascript:, data: (most), vbscript:

Edge cases:
- SVG: allow but sanitize (SVG can contain script)
- Embedded images (data: URLs): block by default; allow if intentional
- HTML in markdown: pass-through after sanitization

Output:
1. Sanitization library + config
2. Allowlist (elements + attributes + URLs)
3. Server + client sanitization strategy
4. Test cases (XSS payloads to verify)
5. Audit cadence

The defense-in-depth rule: sanitize server-side before storing, sanitize client-side before rendering. If one layer fails, the other catches it.

3. Allowlist tables, code blocks, embeds — common requests

Users want more than basic markdown. Plan extensions.

Configure markdown extensions.

GitHub Flavored Markdown (GFM) extras:
- Tables (| col | col |)
- Strikethrough (~~text~~)
- Task lists (- [ ] item)
- Autolinks (http://...)
- Footnotes (with plugin)

Plugin: remark-gfm (one line; covers all)

Code blocks with syntax highlighting:

Options:
- Prism.js (popular; ~6KB + per-language)
- highlight.js (popular; ~30KB minified)
- Shiki (used by VS Code; high quality; larger bundle but server-renderable)

Recommendation:
- Server-side highlight at build time (Shiki) → no client-side cost
- Or client-side with Prism (smaller bundle)
- Lazy-load language packs (don't bundle all)

Math expressions:
- KaTeX (fast; client-side render)
- MathJax (more features; larger)
- Plugin: remark-math + rehype-katex

Diagrams:
- Mermaid (flowcharts, sequence, gantt)
- Plugin: remark-mermaid

Embed support:
- YouTube / Vimeo / Twitter / Loom embeds
- Server-side: parse URLs; render as <iframe> with allowlist + sandbox
- Beware: iframes are XSS surface; only embed allowlisted domains

Custom components (when MDX or react-markdown):
- Override h1, h2, code, etc. with custom React components
- Insert <Callout>, <Tabs>, <CodeBlock> with own logic

Anti-patterns:
- Allow arbitrary HTML in markdown (defeats sanitization)
- Allow user-supplied iframe src (open redirect / XSS)
- Bundle all syntax-highlight languages (massive)

Output:
1. Extensions to enable for [USE CASE]
2. Plugin chain (remark-gfm + remark-math + remark-mermaid as needed)
3. Syntax-highlight choice + bundle strategy
4. Embed allowlist (which domains are OK)
5. Custom components for branded touch

The server-side syntax highlighting trend: Shiki at build time → output styled HTML → no client-side JS for highlighting. Big bundle savings; trades for build time.

4. Render markdown safely with react-markdown

Implement safe markdown rendering with react-markdown.

Install:
- npm install react-markdown remark-gfm rehype-sanitize rehype-highlight

Basic usage:

import ReactMarkdown from 'react-markdown';
import remarkGfm from 'remark-gfm';
import rehypeSanitize from 'rehype-sanitize';
import rehypeHighlight from 'rehype-highlight';

<ReactMarkdown
  remarkPlugins={[remarkGfm]}
  rehypePlugins={[
    rehypeSanitize,  // SECURITY: sanitize first
    rehypeHighlight, // syntax highlight after
  ]}
>
  {userContent}
</ReactMarkdown>

Custom components (override defaults):

<ReactMarkdown
  components={{
    a: ({ node, href, children, ...props }) => (
      <a 
        href={href} 
        target="_blank" 
        rel="noopener noreferrer"
        {...props}
      >
        {children}
      </a>
    ),
    code: ({ inline, children, className }) => (
      inline ? <code>{children}</code> : <CodeBlock className={className}>{children}</CodeBlock>
    ),
    img: ({ src, alt }) => <Image src={src} alt={alt} loading="lazy" />,
  }}
>
  {userContent}
</ReactMarkdown>

Custom sanitize schema (rehype-sanitize):

import { defaultSchema } from 'hast-util-sanitize';

const schema = {
  ...defaultSchema,
  attributes: {
    ...defaultSchema.attributes,
    code: [...(defaultSchema.attributes.code || []), 'className'], // for syntax highlighting
  },
};

<ReactMarkdown rehypePlugins={[[rehypeSanitize, schema]]}>
  {userContent}
</ReactMarkdown>

Output:
1. Install commands
2. Basic safe-render component
3. Custom components for links / code / images
4. Sanitize schema customization
5. Test XSS payloads

The link-target gotcha: by default tags don't have target="_blank" + rel="noopener noreferrer". User-content links should open in new tab + be opener-isolated for security.

5. Performance — render large markdown efficiently

Large markdown documents (knowledge base articles, docs, AI chat outputs) can slow render.

Optimize markdown rendering performance.

Bundle size:
- react-markdown: ~25KB
- remark-gfm: ~10KB
- rehype-sanitize: ~8KB
- rehype-highlight: ~30KB + language packs
- Total: ~75KB+ for full pipeline

Lazy load:
- Don't bundle all languages for syntax highlighting (highlight.js core ~30KB; with all langs ~600KB)
- Load specific languages on demand
- Or use Shiki server-side at build time

Rendering large docs:
- 10K+ word docs slow react-markdown
- Solutions:
  - Virtualize sections (react-virtuoso)
  - Server-render and hydrate (Next.js / RSC)
  - Cache rendered HTML (key by content hash)

AI chat streaming:
- LLM tokens stream in
- Partial markdown rendering
- Re-render on each chunk (or debounce)
- Performance: memo on stable parts; only re-render last paragraph

Memoization:
- React.memo on ReactMarkdown wrapper
- useMemo on plugin array (otherwise plugin instances change every render)
- Cache key: content + plugin config

For [USE CASE]:
1. Bundle budget
2. Lazy-load strategy
3. SSR vs CSR
4. Memoization plan
5. Streaming-render strategy (for AI chat)

The plugin-instance memoization gotcha: <ReactMarkdown remarkPlugins={[remarkGfm]}> creates a new array every render → reconciliation re-mounts. Memoize the array or define outside component.

6. Streaming markdown — AI chat use case

LLM responses arrive as streaming tokens. Render progressively without flickering.

Render streaming markdown.

Pattern:
- LLM streams tokens (SSE / WebSocket)
- Buffer accumulates: "The answer..." then "The answer is..." then "The answer is 42."
- Re-render markdown on each chunk

Challenges:
- Incomplete markdown ("**bold without close" mid-stream)
- Code blocks open at start, close later
- Lists / tables span multiple lines

Solutions:

Auto-close incomplete syntax (recommended):
- Detect open ** or open ``` mid-stream
- Append closing tokens before parsing
- Library: streaming-markdown or DIY

Buffer until stable boundaries:
- Wait for newline before parsing previous line
- Adds latency; reduces flicker

Render-on-debounce:
- Debounce 50-100ms
- Reduce re-render churn

Visual cues:
- Cursor at end of stream (▋)
- Disabled state during stream
- "Thinking..." placeholder

Anti-patterns:
- Flash of unparsed markdown (raw asterisks visible)
- Layout jump as content streams in
- Re-rendering entire doc on each token (slow)

For AI chat:
- Use streaming-markdown library
- Memoize message components (only re-render the actively-streaming message)
- Show cursor during stream
- Smooth scroll to bottom

Output:
1. Streaming-markdown library or DIY
2. Auto-close strategy
3. Debounce config
4. Visual cursor / state
5. Test cases (interrupted streams, network errors)

The 2026 standard for AI chat: streaming-markdown library + memoized message list + cursor indicator. Used by ChatGPT / Claude / Perplexity UIs.

7. Edit + preview — split or toggle

For markdown input (not just rendering), users want to see preview.

Implement markdown edit + preview.

Patterns:

Pattern A: Side-by-side (split)
- Edit left; preview right
- Real-time update
- Used by: GitHub, GitLab, many docs sites
- Best for: power users; wide screens

Pattern B: Tab toggle
- "Write" / "Preview" tabs
- One at a time
- Used by: Reddit, Stack Overflow, mobile-first
- Best for: narrow screens; non-technical users

Pattern C: Inline rendering (WYSIWYG-feel)
- As-you-type styling without separate preview
- Used by: Notion, Linear (via TipTap / Lexical)
- Best for: end-users; highest polish
- Note: this is rich-text editor territory, not pure markdown

Implementation:

Pattern A (split):
- Two columns; sync scroll position
- Debounced render (50-100ms)
- Memoize render component

Pattern B (tab):
- Tab state in component
- Cache rendered HTML to avoid re-render on tab switch

Pattern C (inline):
- Use TipTap or Lexical (rich-text editors)
- See rich-text-editor-implementation-chat for details

For [USE CASE]:
- Power user / docs-heavy → split
- General user / mobile → tab toggle
- Non-technical end user → inline (rich-text editor)

Output:
1. Pattern recommendation
2. Component implementation
3. Performance considerations
4. Mobile fallback
5. Keyboard shortcuts (Cmd+Enter to submit; Tab for indent)

The "split panes look professional" trap: split panes are great for engineers. For non-technical users, side-by-side is intimidating. Tab toggle or inline rendering wins for mass adoption.

8. Markdown for AI chat — safe LLM output rendering

LLMs sometimes output unsafe markdown. Sanitize.

Render LLM markdown output safely.

Threats:
- LLM hallucinates malicious links (rare but possible)
- LLM outputs HTML that bypasses markdown
- User prompts LLM to output XSS payload as test

Defenses:
- Same sanitization as user content (DOMPurify / rehype-sanitize)
- Allowlist: standard markdown elements; no script / iframe
- Treat LLM output as untrusted input
- Block javascript: URLs

Trust levels:
- High-trust internal LLM (your own model): can be slightly looser
- Public-LLM-via-API (OpenAI, Anthropic): treat as untrusted
- User-provided LLM output: definitely untrusted

LLM-specific extensions:
- Tool calls / structured output: render as cards (not markdown)
- Citations / sources: render as links with allowlist
- Images: only allow if from your generation pipeline (DALL-E / Midjourney URLs allowed)

Performance:
- Render-on-stream (see above)
- Cache rendered output (by message id)

Output:
1. Sanitization same as user content
2. LLM-specific allowlist (citations, images)
3. Structured output rendering (tool calls)
4. Streaming integration
5. Test cases (prompt injection → markdown output)

The prompt-injection-via-markdown attack: user prompts LLM to "output the following HTML." LLM dutifully outputs <script>alert(1)</script>. Sanitize.

9. Storage format — markdown vs HTML vs AST

Where you store affects what flexibility you have.

Decide storage format for markdown content.

Option 1: Store markdown source
- Pros: human-readable; small; portable; can re-render with different config
- Cons: render at every read (CPU); inconsistency if config changes
- Best for: most B2B SaaS

Option 2: Store rendered HTML
- Pros: faster reads (no parse)
- Cons: stale if rules change; larger storage; harder to edit
- Best for: archival; static publishing

Option 3: Store both (markdown + cached HTML)
- Pros: fast reads + flexible re-render
- Cons: more storage; need invalidation
- Best for: high-traffic content

Option 4: Store AST (mdast / hast JSON)
- Pros: programmatic transformation; easy traversal
- Cons: complex; not portable
- Best for: editors building custom transformations

Recommendation:
- B2B SaaS user content: store markdown source; render on read with cache
- High-traffic blog / docs: store HTML; rebuild on config change
- Rich-text editors (TipTap / Lexical): store editor JSON (AST-like)

Cache strategies:
- Redis / KV cache: key by content hash
- TanStack Query: client-side cache
- Server-render once + hydrate

Output:
1. Recommendation for [USE CASE]
2. Schema (column types, sizes)
3. Caching strategy
4. Invalidation rules
5. Migration path if format changes

The simplest pattern: store markdown; render on read; cache for hot paths. Don't over-engineer.

10. Test XSS — verify the pipeline

Test markdown rendering for XSS.

Test payloads (must NOT execute):

Basic:
- <script>alert(1)</script>
- <img src=x onerror=alert(1)>
- <svg onload=alert(1)>
- <iframe src=javascript:alert(1)>

Markdown-encoded:
- [click me](javascript:alert(1))
- ![image](javascript:alert(1))
- [link](data:text/html,<script>alert(1)</script>)

HTML in markdown:
- <details><summary>open</summary><script>alert(1)</script></details>
- <a href="javascript:alert(1)">click</a>

Mutation XSS:
- <noscript><p title="</noscript><img src=x onerror=alert(1)>">
- Polyglot payloads (XSS that survives multiple parsers)

Test approach:
- Unit tests: feed payload → assert sanitized output (no alert)
- Storybook: visual regression
- Manual: paste in dev environment; check console
- Bug bounty: incentive external testing

Tools:
- DOMPurify test suite (mature; battle-tested)
- OWASP XSS cheat sheet (payload library)
- jest + @testing-library

Audit:
- Manual review every config change
- Automated tests in CI
- Quarterly security review

Output:
1. Test payload library (10-30 cases minimum)
2. Unit tests
3. Visual regression
4. Pen-test process (annually)
5. Bug bounty / report channel

The only-test-the-happy-path failure: tests that pass simple markdown but never test XSS. Always include malicious-input tests.

What Done Looks Like

A v1 markdown rendering system for B2B SaaS in 2026:

Add later when product is mature:

The mistake to avoid: using dangerouslySetInnerHTML without sanitization. Direct path to XSS.

The second mistake: trusting LLM output. Sanitize same as user input.

The third mistake: bundling all syntax-highlight languages. Lazy-load or server-render with Shiki.