The Great Experiment

A/B testing, sample ratio mismatch, and statistical power

What you'll learn

  • How A/B testing measures causal impact of changes
  • What Sample Ratio Mismatch (SRM) is and why it signals bugs
  • How statistical power determines if your sample size is sufficient
  • The critical role of pre-experiment planning

A/B Testing Fundamentals

Your new pricing algorithm shows a 5-percent improvement in revenue per ride. But is it real? Could randomness alone explain this difference?

A/B Testing compares two variants:

  • Control (A): Current algorithm, status quo
  • Treatment (B): New algorithm, the change being tested

Statistical rigor ensures we don't chase false positives or ignore true effects.

Revenue Distribution: Control vs Treatment

Below is the distribution of revenue per ride for control and treatment groups after running the A/B test on 10,000 rides total.

Loading distribution...

Observations:

  • Control mean: ~$12.52 (blue)
  • Treatment mean: ~$13.07 (amber)
  • Observed lift: ~4.4%
  • The distributions overlap, but treatment is visibly shifted right

The question: Is this 4.4% lift statistically significant, or just noise?

Sample Ratio Mismatch (SRM) Check

Before interpreting results, always verify your randomization worked correctly. Sample Ratio Mismatch occurs when your actual treatment/control split doesn't match your intended allocation (e.g., intended 50-50, actual 45-55).

Loading SRM check...

Why SRM matters:

  • Often indicates a bug in experiment infrastructure (not the algorithm)
  • Can invalidate statistical tests
  • Check SRM BEFORE interpreting results

SRM causes:

  • User segmentation bugs
  • Regional traffic imbalances
  • Bot traffic patterns
  • Browser caching issues

Statistical Power & Sample Size

Power measures the probability of detecting a true effect:

  • High Power (80-90%): Likely to detect real differences
  • Low Power (<50%): May miss real improvements
  • Type I Error (α): False positive rate (typically 5 percent)
  • Type II Error (β): False negative rate (typically 10-20%)

To detect a 5-percent revenue lift with 80 percent power, how many samples do we need per group?

Loading power curve...

Interpretation:

  • At ~464 samples per group, we reach 80% power
  • We used 5,000 samples per group, so we're well above this threshold
  • This experiment has very high power to detect the effect size

Pre-Experiment Checklist

Before launching any A/B test, ensure:

  1. Define hypothesis clearly: "New pricing increases revenue per ride by ≥ 5 percent"
  2. Calculate required sample size for desired power
  3. Check randomization (SRM check)
  4. Specify primary and secondary metrics upfront
  5. Set significance level (α) and power target
  6. Determine duration: Until reaching sample size, not based on p-values

Common Pitfalls

❌ Peeking at results early: Increases false positive rates. Decide on sample size upfront.

❌ Multiple comparisons: Each additional metric tested inflates false positive probability. Use corrections like Bonferroni.

❌ Unequal variance: Choose appropriate statistical tests (Welch's t-test if variances differ).

❌ Selection bias: Excluding users mid-experiment invalidates randomization.

❌ Correlation between treatments: Users seeing multiple experiments simultaneously.

Statistical Tests

Continuous Metrics (Revenue, Time)

from scipy.stats import ttest_ind

# T-test compares means
t_stat, p_value = ttest_ind(control_revenue, treatment_revenue)

# If p_value < 0.05, reject null hypothesis (difference is significant)
if p_value < 0.05:
    print(f"Lift is statistically significant (p={p_value:.4f})")

Categorical Metrics (Conversion Rate)

from scipy.stats import chi2_contingency

# Chi-square test for proportions
contingency_table = [
    [control_conversions, control_non_conversions],
    [treatment_conversions, treatment_non_conversions]
]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

Real-World Practice: Experiment Best Practices

How major tech companies structure A/B testing
CompanyBest PracticeKey Insight
NetflixPre-register experiments in XPGUARD before launchingPrevents p-hacking and researcher bias
UberMinimum sample size threshold; never peek at resultsFixed sample size prevents false positives from peeking
AirbnbAnalyze by segment (geography, user cohort, device)Identify which groups benefit most from change
DoorDashRun experiments for full week cycles (Mon-Sun) to capture day-of-week effectsTemporal patterns matter; control for them

Key Takeaways

A/B Testing Checklist

  • Define hypothesis upfront (before seeing data)
  • Calculate required sample size for 80% power
  • Check SRM before trusting results
  • Never peek at results during experiment
  • Use pre-registered primary metrics only
  • Understand effect size vs statistical significance
  • Replicate results on holdout data before shipping

Where This Connects

This chapter showed how to rigorously test changes in production. In Chapter 5: The Variance Reduction, we'll learn advanced techniques like CUPED to reduce variance and detect smaller effects with fewer samples — making experiments faster and more efficient.