Run A/B Tests That Move Conversion and Retention

Product Experimentation for Your New SaaS

Goal: Stand up a real experimentation system — not vanity tests on button colors. Instrument experiments that change pricing-page conversion, onboarding completion, activation rate, and retention. Get to a cadence where you ship 2–4 well-powered tests per month and 25–35% of them move a real metric in the right direction.

Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.

Timeframe: Test framework wired up in 1 day. First experiment running by end of week 1. First confident "ship it" decision by end of week 4. Sustained 2–4 tests per month from month 2 onward.

Why Most Founder A/B Tests Are Useless

Three failure modes hit early-stage teams in the same order every time:

Underpowered tests on micro-traffic. The founder runs a button-color test on 200 visitors, sees a 12% lift, ships it, and never sees the lift again. With 200 visitors and a 5% baseline conversion, the minimum detectable effect is around 80% relative lift. Anything smaller is statistical noise dressed up as a result.
Testing things that don't matter. Hero copy, button color, social-proof badge order — these rarely move primary metrics by more than a percentage point. The tests that move SaaS revenue are pricing structure, paywall placement, onboarding sequence length, and activation milestone definition. Founders avoid those because they feel scary; they're scary because they actually matter.
No pre-registered hypothesis. Founders ship a variant, peek at the dashboard daily, declare victory the first time the line crosses, and then act surprised when the lift evaporates. Without a fixed sample size, a fixed metric, and a written "I will ship if X" rule, every test becomes a dressed-up coin flip.

The version that works is structured: pick experiments that target known-broken funnel steps, calculate sample size before launching, write a one-paragraph hypothesis with a ship rule, run the test to its planned end, and document every loss as carefully as every win.

This guide assumes you have already done PostHog Setup (you cannot run experiments without product analytics), have Feature Flags wired up (the test mechanism), and have completed Activation Funnel Diagnosis (so you know which step in the funnel is actually broken — that's where to test).

1. Pick the Right Test

Before writing any code, decide what to test. Most founder energy is wasted here.

I'm building [your product] at [your-domain.com]. The product does [one-sentence description]. My current funnel and conversion rates by step are:

- Visitor → trial signup: [X]%
- Trial signup → activation event: [X]%
- Activation → paid conversion: [X]%
- Paid → retained at 90 days: [X]%

Help me pick the right A/B test to run first. I want to test the step where:

1. The conversion rate is far below benchmark (use SaaS benchmarks from sources like OpenView, ChartMogul, or Lenny's Newsletter)
2. The traffic volume is large enough to power a test in 2-3 weeks
3. The change required is feasible to implement in 1-3 days
4. The downside if the test loses is small enough that I'm willing to ship it both ways

For each candidate test, output:
- Funnel step targeted
- Hypothesis in plain English ("If we change X, then Y will improve because Z")
- Variant A (control) and Variant B (treatment) descriptions
- Primary metric (one — must be downstream of the change)
- Guardrail metrics (2-3 metrics that must not get worse)
- Estimated traffic per variant per week
- Minimum detectable effect (MDE) at 80% power, 95% confidence

Reject any test where the MDE is larger than what's plausible. If my pricing-page traffic is 800 visits/month and baseline conversion is 4%, an A/B test would need ~30% relative lift to detect — that's almost never achievable. In that case, recommend a multivariate redesign with a before/after pre-post analysis instead, and tell me explicitly that we are not running an A/B test.

A few rules I've watched founders re-learn the hard way:

Test the steps with the largest absolute drop-off, not the most clicks. A 60% → 40% drop in one step is worth 10x more attention than a button on a page everybody already converts on.
Pre-paid changes have higher leverage than post-paid. Onboarding, pricing-page layout, paywall placement, and trial length all compound. Color tweaks on a settings page do not.
If your traffic can't power the test in 4 weeks, do not run an A/B test. Run a redesign with pre-post analysis and acknowledge the limit. Pretending a 200-sample test is rigorous is worse than not testing at all.

2. Calculate Sample Size Before Touching Code

The single biggest determinant of whether your experiment is meaningful is whether you ran it long enough. Calculate before you build.

For my experiment [restate the test from step 1], calculate the required sample size and runtime.

Inputs:
- Baseline conversion rate on the primary metric: [X]%
- Minimum detectable effect I care about: [X]% relative lift (e.g., 10% relative — meaning 4% baseline becomes 4.4%)
- Statistical power: 80% (industry standard — accept 20% chance of missing a real effect)
- Significance level: 95% (5% false-positive rate)
- Two-sided test (we care if it goes up or down)
- Two variants (control and treatment, 50/50 split)

Output:
1. Required sample size per variant
2. Total sample size across both variants
3. Expected weekly traffic to the test surface (I'll provide this if needed)
4. Estimated days/weeks to reach the sample size
5. The pre-registered ship rule in plain text: "If treatment beats control by at least X with p < 0.05 at sample size N, we ship treatment. Otherwise we keep control."

Then sanity-check: if the runtime is over 6 weeks, recommend either:
(a) increasing the MDE I'm willing to accept (smaller effects are not worth detecting at the cost of multi-month tests), or
(b) splitting traffic 70/30 toward treatment if I have prior reason to believe treatment is better, or
(c) abandoning the A/B test in favor of a sequential redesign with pre-post analysis.

Use a standard frequentist sample-size calculation. If I want to use a Bayesian framework instead, output the Beta-Binomial prior assumptions and the stopping rule (typically: stop when probability(treatment > control) > 95% or < 5%, with a minimum sample of [N] to avoid early-stop bias).

Key numbers to internalize for SaaS experiments:

5% baseline → 10% relative lift requires ~30,000 samples per variant (frequentist, 80/95)
20% baseline → 10% relative lift requires ~7,000 samples per variant
50% baseline → 10% relative lift requires ~2,000 samples per variant

Pricing-page conversion is usually low-baseline and high-traffic-needed. Onboarding-step completion is usually higher-baseline and easier to test. This is why early-stage founders should run more onboarding experiments than pricing experiments — the math works.

3. Wire Up the Experiment in Code

Now build it. Use feature flags, not URL params or branches.

Help me implement the [test name] A/B test in my [Next.js / SvelteKit / Remix / your framework] app using PostHog as both the assignment service and the analytics service.

Requirements:
1. Use a PostHog feature flag with multivariate targeting (control / treatment, 50/50 split, sticky to user_id or distinct_id so the same user always sees the same variant).
2. Assignment must happen server-side on the first request (so we don't ship the wrong variant to a logged-in user, and so we don't have client-side flicker).
3. Bucket assignment must be persisted to the user record in our database the first time they hit the test surface — so we never re-bucket a user mid-experiment, and so we can join exposure to outcomes in our warehouse.
4. Fire a posthog.capture('experiment_viewed', { experiment: '[test_name]', variant: 'control'|'treatment' }) the first time the user sees the test surface — this is our exposure event.
5. Fire posthog.capture('experiment_converted', { experiment: '[test_name]', variant: 'control'|'treatment', metric: 'primary' | 'guardrail_1' | ... }) when the primary or guardrail metric event fires.
6. Implement an experimentExperience() helper so the rest of the codebase calls one function: const variant = await experimentExperience({ experimentKey, userId }); and gets back 'control' or 'treatment'. The helper handles flag fetch, persistence, and exposure tracking.

Edge cases to handle:
- Logged-out users who later log in mid-experiment — bucket by anonymous_id at view time, then identify-merge to user_id on signup so we don't double-count
- Users who hit the test surface before the experiment starts — exclude from the analysis (their bucket is undefined)
- Users in the holdout / excluded segment (e.g., enterprise customers, internal team) — return 'control' deterministically
- Caching layers (CDN, ISR) — use the framework's cache-busting for the test page or render server-side only

Deliver:
- The PostHog feature flag config (key, payload, rollout percentages, targeting rules)
- The experimentExperience() helper code
- The page-level integration showing how to read the variant and render the right UI
- The dashboard query for exposure-to-conversion rate by variant in PostHog Insights

Three traps to flag explicitly:

Client-side flag fetches cause flicker — the user sees control for 200ms, then treatment swaps in. Render server-side or pre-fetch the flag at session start.
Don't use URL params for assignment. ?variant=b is fragile, breaks SEO, doesn't survive refreshes, and corrupts your analytics.
Sticky bucketing is non-negotiable. A user who sees control on visit 1 must see control on visit 2. Otherwise you're testing two different experiences against each other every session and your data is junk.

4. Pre-Register the Analysis Plan

Write down what you'll do before you see results. This is the single highest-leverage habit in experimentation.

Generate a pre-registration document for the [test name] experiment. The document must be written and committed to our experiments/ folder BEFORE the experiment starts. Use this template:

# Experiment: [test_name]
- Owner: [your name]
- Start date: [planned date]
- Planned end date: [planned date — ideally exact, not "when significant"]
- Hypothesis: [one sentence: if we change X then Y will improve by Z because reason]
- Variants:
  - Control: [exact description of what control sees]
  - Treatment: [exact description of what treatment sees]
- Allocation: 50/50 by user_id, sticky
- Primary metric: [exact event and the conversion definition — e.g., "user fires `subscription_started` event within 14 days of `experiment_viewed`"]
- Guardrail metrics:
  - [metric 1, e.g., trial_signup rate must not drop more than 5% relative]
  - [metric 2, e.g., support_ticket_opened rate must not increase more than 20% relative]
- Sample size required per variant: [from step 2]
- Expected runtime: [from step 2]
- Ship rule: [exact wording — "If treatment beats control on primary metric with p<0.05 at planned sample size, AND no guardrail violation, ship treatment. Otherwise keep control."]
- Stopping rule: [exact wording — usually "do not peek before sample size is reached; if a guardrail metric breaches by >2x its threshold, stop the experiment immediately"]
- Segment cuts we will report: [overall, plus any pre-specified segments — e.g., self-serve vs sales-led, mobile vs desktop. Specify segments BEFORE seeing results so we don't HARK]

Output the document. Save to experiments/[YYYY-MM-DD]-[test_name].md.

Two principles that prevent self-deception:

The decision rule has to be written before the data is in. Otherwise you'll find a slice of the data where treatment looks good and ship that. Pre-registration eliminates "p-hacking by accident."
No peeking before the planned sample size. If you check daily and stop the first time p < 0.05, your effective false-positive rate is ~25%, not 5%. The math is well-documented; the discipline is rare.

5. Run the Experiment

Launch, monitor guardrails, otherwise leave it alone.

Help me write the experiment runbook. The runbook must answer:

1. What's the daily monitoring cadence?
   - Day 0: confirm exposure events are firing for both variants in PostHog. If counts are skewed >55/45, abort and debug — this is usually a bucketing bug.
   - Daily: check guardrail metrics ONLY. Do not look at primary metric until planned sample size is reached.
   - At 25% of planned sample size: sanity-check that volume projections match plan. If runtime is going to overshoot 1.5x, decide whether to extend or accept a slightly underpowered test.

2. What signals abort the experiment early?
   - Guardrail metric drops >2x worst-case bound (e.g., signup rate halves)
   - Bug discovered in either variant
   - Bucket-balance check fails (>55/45 split with no plausible explanation)
   - Existential issue (data leak, broken payment flow, regulatory concern)

3. What does the analysis look like at planned end?
   - Primary metric: report rate, lift, p-value, confidence interval
   - Guardrail metrics: same
   - Pre-specified segments: same
   - Sample-ratio mismatch test: confirm 50/50 split was actually 50/50 (chi-squared)
   - Apply pre-registered ship rule. Do not invent new rules at this stage.

4. What's the documentation output?
   - One-page experiment readout: hypothesis, result, decision, learnings
   - Saved to experiments/[YYYY-MM-DD]-[test_name]-results.md alongside the pre-registration
   - Linked from the team's experiment log

Generate the runbook as a checklist I can copy-paste into the experiment ticket.

The most valuable habit during a live experiment is to do nothing. Founders who sit on their hands for 14 days run honest tests. Founders who tweak the variant copy mid-flight, or kill the test "because it looks like treatment is winning," produce noise.

6. Analyze and Decide

When the experiment hits the planned sample size, you analyze and ship the predetermined decision.

Help me write the analysis script for the [test name] experiment. Use [PostHog SQL / Mode / Hex / Python pandas — pick one and stick with it].

Inputs:
- Exposure event: experiment_viewed where experiment='[test_name]'
- Conversion event: [exact event name and any property filters]
- Conversion window: [N days from exposure]
- Variant property on the exposure event: variant ∈ {control, treatment}

Output one table:

| Metric | Control rate | Treatment rate | Absolute diff | Relative diff | p-value | 95% CI | Conclusion |
|---|---|---|---|---|---|---|---|
| Primary: [metric] | x% | y% | +/-z pp | +/-w% | 0.0X | [low, high] | Significant Y/N |
| Guardrail 1: [metric] | ... | ... | ... | ... | ... | ... | ... |
| Guardrail 2: [metric] | ... | ... | ... | ... | ... | ... | ... |

Plus:
- Sample-ratio check: actual split vs planned (chi-squared p-value)
- Sample size achieved per variant vs planned
- Date range of analysis
- Number of users excluded and why (e.g., bucketed before experiment start, in holdout segment)

Then: apply the pre-registered ship rule from the registration doc. Output the decision in one sentence: "Ship treatment" or "Keep control" or "Inconclusive — extend N more days" — with the explicit reasoning.

Two patterns that come up almost every time:

A "winning" test with a sample-ratio mismatch is not winning. If your assignment was 52/48 instead of 50/50 with no benign explanation, your bucketing is biased and the result is not trustworthy. Re-run the test.
An "inconclusive" test is a real result. It means the effect, if any, is smaller than your MDE. Document it, ship the simpler variant, and move on. "We don't know" is a valid and frequent answer in honest experimentation.

7. Document the Test and the Learning

Every experiment, win or loss, becomes a documented entry. The compounding asset is the experiment log, not any individual test.

Generate the post-experiment readout for [test name]. Use this template:

# Experiment Result: [test_name]
- Ran: [start date] to [end date]
- Sample size achieved: [N per variant]
- Primary metric: [metric, control rate, treatment rate, lift, p-value]
- Guardrail metrics: [list with rates and verdicts]
- Decision: [Ship treatment | Keep control | Extend | Stop]
- Effect on annualized revenue: [back-of-envelope calculation — e.g., "+0.4 pp on trial→paid at $X average ARPA over Y trials/year = +$Z annual"]

## What We Believed Before
[Restate the hypothesis from the pre-registration]

## What We Believe Now
[Updated hypothesis based on results — what did we learn about user behavior, not just whether the metric moved?]

## What We're Doing Next
[The follow-up experiment or product change implied by this result]

Save to experiments/[YYYY-MM-DD]-[test_name]-results.md. Append a one-line summary to experiments/log.md so future-me can scan all past tests in 30 seconds.

The experiment log is more valuable than any single experiment. After 12–18 months of disciplined logging, founders can answer "what have we learned about how our users behave?" with evidence — and that evidence drives a better roadmap than either intuition or competitive analysis.

8. Build a Cadence

One test is a curiosity. A cadence is a system.

Help me design an experiment cadence I can sustain solo or with a small team. Goals:

- 2-4 well-powered experiments running per month
- 1 retrospective per quarter (what tests did we run, what did we learn, what surprised us, what's the experiment hypothesis backlog now)
- 1 written experiment hypothesis added per week (idea logging — ideas can sit in the backlog for months until traffic supports them)

Output:
1. A weekly cadence (which day = ship a test, which day = read results, which day = generate hypotheses)
2. A backlog template (hypothesis, target funnel step, expected MDE, traffic ready Y/N, estimated implementation cost)
3. A prioritization rule for which experiment to pull off the backlog next (rule of thumb: highest expected value × lowest implementation cost × highest traffic-readiness)
4. The "kill criteria" for the program itself — when should I stop doing experiments and just build product (e.g., "if 0/8 experiments in a quarter moved a metric, our hypothesis quality is the bottleneck, not our test infrastructure")

Output the cadence as a markdown doc I save to experiments/cadence.md.

The point of the cadence is not to run more tests; it's to build the muscle of asking "what would I have to believe for X to be the right move, and how could I test that?" — and to build a body of evidence that compounds.

What Done Looks Like

By the end of week 4, you should have:

One completed experiment with pre-registration, runbook, analysis, and a published readout — even if the result was "no effect."
A working experimentExperience() helper that any new test can plug into in under an hour.
An experiments/ folder with a log, the cadence document, and a backlog of 5–10 hypotheses.
One updated mental model about your users — something you didn't know before the experiment started.

By the end of month 3, you should be running 2–4 tests per month and have a 6–10 entry experiment log. By month 6, the log is your most valuable strategic asset: it tells you which interventions move metrics in your specific product, which is more useful than any blog post, conference talk, or competitor teardown.

Common Pitfalls

Running experiments without instrumented baselines. If you don't already know your funnel rates from PostHog Setup, every experiment is sand-castle math. Instrument first, test second.
HARKing — Hypothesizing After Results are Known. "Treatment didn't beat control overall, but on mobile it did, so let's ship to mobile." If mobile wasn't a pre-registered segment, you can't make that call without lying to yourself. File it as a hypothesis for the next test.
Treating launch-week traffic as steady-state. Launches inflate traffic temporarily. Experiments started during launch week often hit sample size in 3 days, then "decay" as traffic normalizes. Wait for steady-state before starting your first test.
Optimizing locally instead of structurally. A 5% lift on the pricing page is real money, but if your activation rate is broken, the structural fix (re-onboarding, reduce churn, or pricing redesign) outranks any single A/B test.

Where Experimentation Plugs Into the Rest of the Stack

PostHog Setup — analytics layer that powers exposure tracking and analysis.
Feature Flags — the assignment mechanism. A/B tests are a special case of feature flags.
Activation Funnel — tells you which step in the funnel is broken, i.e., where to test.
Pricing Page — common test target.
Onboarding Email Sequence — common test target (subject lines, send timing, sequence length).
Reduce Churn — retention experiments live here.
LLM Quality Monitoring — for AI-native products, prompt and model A/B tests use the same harness.

What's Next

Once your experimentation cadence is running, the limiting factor stops being test infrastructure and becomes hypothesis quality. Read every cancellation reason, every customer-success conversation, every churn-risk signal as a source of testable hypotheses. Cycle them through the cadence above. The compounding effect over a year is larger than any single feature you could ship instead.

Build the discipline now while traffic is small. The team that ships its first 8 honest experiments at year 1 has a permanent edge over the team that's still arguing about button colors at year 3.

⬅️ Growth Overview