Chapter 4: The Great Experiment

What you'll learn

How A/B testing measures causal impact of changes
What Sample Ratio Mismatch (SRM) is and why it signals bugs
How statistical power determines if your sample size is sufficient
The critical role of pre-experiment planning

A/B Testing Fundamentals

Your new pricing algorithm shows a 5-percent improvement in revenue per ride. But is it real? Could randomness alone explain this difference?

A/B Testing compares two variants:

Control (A): Current algorithm, status quo
Treatment (B): New algorithm, the change being tested

Statistical rigor ensures we don't chase false positives or ignore true effects.

Revenue Distribution: Control vs Treatment

Below is the distribution of revenue per ride for control and treatment groups after running the A/B test on 10,000 rides total.

Loading distribution...

Observations:

Control mean: ~$12.52 (blue)
Treatment mean: ~$13.07 (amber)
Observed lift: ~4.4%
The distributions overlap, but treatment is visibly shifted right

The question: Is this 4.4% lift statistically significant, or just noise?

Sample Ratio Mismatch (SRM) Check

Before interpreting results, always verify your randomization worked correctly. Sample Ratio Mismatch occurs when your actual treatment/control split doesn't match your intended allocation (e.g., intended 50-50, actual 45-55).

Loading SRM check...

Why SRM matters:

Often indicates a bug in experiment infrastructure (not the algorithm)
Can invalidate statistical tests
Check SRM BEFORE interpreting results

SRM causes:

User segmentation bugs
Regional traffic imbalances
Bot traffic patterns
Browser caching issues

Statistical Power & Sample Size

Power measures the probability of detecting a true effect:

High Power (80-90%): Likely to detect real differences
Low Power (<50%): May miss real improvements
Type I Error (α): False positive rate (typically 5 percent)
Type II Error (β): False negative rate (typically 10-20%)

To detect a 5-percent revenue lift with 80 percent power, how many samples do we need per group?

Loading power curve...

Interpretation:

At ~464 samples per group, we reach 80% power
We used 5,000 samples per group, so we're well above this threshold
This experiment has very high power to detect the effect size

Pre-Experiment Checklist

Before launching any A/B test, ensure:

Define hypothesis clearly: "New pricing increases revenue per ride by ≥ 5 percent"
Calculate required sample size for desired power
Check randomization (SRM check)
Specify primary and secondary metrics upfront
Set significance level (α) and power target
Determine duration: Until reaching sample size, not based on p-values

Common Pitfalls

❌ Peeking at results early: Increases false positive rates. Decide on sample size upfront.

❌ Multiple comparisons: Each additional metric tested inflates false positive probability. Use corrections like Bonferroni.

❌ Unequal variance: Choose appropriate statistical tests (Welch's t-test if variances differ).

❌ Selection bias: Excluding users mid-experiment invalidates randomization.

❌ Correlation between treatments: Users seeing multiple experiments simultaneously.

Statistical Tests

Continuous Metrics (Revenue, Time)

from scipy.stats import ttest_ind

# T-test compares means
t_stat, p_value = ttest_ind(control_revenue, treatment_revenue)

# If p_value < 0.05, reject null hypothesis (difference is significant)
if p_value < 0.05:
    print(f"Lift is statistically significant (p={p_value:.4f})")

Categorical Metrics (Conversion Rate)

from scipy.stats import chi2_contingency

# Chi-square test for proportions
contingency_table = [
    [control_conversions, control_non_conversions],
    [treatment_conversions, treatment_non_conversions]
]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

Real-World Practice: Experiment Best Practices

How major tech companies structure A/B testing
Company	Best Practice	Key Insight
Netflix	Pre-register experiments in XPGUARD before launching	Prevents p-hacking and researcher bias
Uber	Minimum sample size threshold; never peek at results	Fixed sample size prevents false positives from peeking
Airbnb	Analyze by segment (geography, user cohort, device)	Identify which groups benefit most from change
DoorDash	Run experiments for full week cycles (Mon-Sun) to capture day-of-week effects	Temporal patterns matter; control for them

Key Takeaways

A/B Testing Checklist

Define hypothesis upfront (before seeing data)
Calculate required sample size for 80% power
Check SRM before trusting results
Never peek at results during experiment
Use pre-registered primary metrics only
Understand effect size vs statistical significance
Replicate results on holdout data before shipping

Where This Connects

This chapter showed how to rigorously test changes in production. In Chapter 5: The Variance Reduction, we'll learn advanced techniques like CUPED to reduce variance and detect smaller effects with fewer samples — making experiments faster and more efficient.