What you'll learn
- How A/B testing measures causal impact of changes
- What Sample Ratio Mismatch (SRM) is and why it signals bugs
- How statistical power determines if your sample size is sufficient
- The critical role of pre-experiment planning
A/B Testing Fundamentals
Your new pricing algorithm shows a 5-percent improvement in revenue per ride. But is it real? Could randomness alone explain this difference?
A/B Testing compares two variants:
- Control (A): Current algorithm, status quo
- Treatment (B): New algorithm, the change being tested
Statistical rigor ensures we don't chase false positives or ignore true effects.
Revenue Distribution: Control vs Treatment
Below is the distribution of revenue per ride for control and treatment groups after running the A/B test on 10,000 rides total.
Loading distribution...
Observations:
- Control mean: ~$12.52 (blue)
- Treatment mean: ~$13.07 (amber)
- Observed lift: ~4.4%
- The distributions overlap, but treatment is visibly shifted right
The question: Is this 4.4% lift statistically significant, or just noise?
Sample Ratio Mismatch (SRM) Check
Before interpreting results, always verify your randomization worked correctly. Sample Ratio Mismatch occurs when your actual treatment/control split doesn't match your intended allocation (e.g., intended 50-50, actual 45-55).
Loading SRM check...
Why SRM matters:
- Often indicates a bug in experiment infrastructure (not the algorithm)
- Can invalidate statistical tests
- Check SRM BEFORE interpreting results
SRM causes:
- User segmentation bugs
- Regional traffic imbalances
- Bot traffic patterns
- Browser caching issues
Statistical Power & Sample Size
Power measures the probability of detecting a true effect:
- High Power (80-90%): Likely to detect real differences
- Low Power (<50%): May miss real improvements
- Type I Error (α): False positive rate (typically 5 percent)
- Type II Error (β): False negative rate (typically 10-20%)
To detect a 5-percent revenue lift with 80 percent power, how many samples do we need per group?
Loading power curve...
Interpretation:
- At ~464 samples per group, we reach 80% power
- We used 5,000 samples per group, so we're well above this threshold
- This experiment has very high power to detect the effect size
Pre-Experiment Checklist
Before launching any A/B test, ensure:
- Define hypothesis clearly: "New pricing increases revenue per ride by ≥ 5 percent"
- Calculate required sample size for desired power
- Check randomization (SRM check)
- Specify primary and secondary metrics upfront
- Set significance level (α) and power target
- Determine duration: Until reaching sample size, not based on p-values
Common Pitfalls
❌ Peeking at results early: Increases false positive rates. Decide on sample size upfront.
❌ Multiple comparisons: Each additional metric tested inflates false positive probability. Use corrections like Bonferroni.
❌ Unequal variance: Choose appropriate statistical tests (Welch's t-test if variances differ).
❌ Selection bias: Excluding users mid-experiment invalidates randomization.
❌ Correlation between treatments: Users seeing multiple experiments simultaneously.
Statistical Tests
Continuous Metrics (Revenue, Time)
from scipy.stats import ttest_ind
# T-test compares means
t_stat, p_value = ttest_ind(control_revenue, treatment_revenue)
# If p_value < 0.05, reject null hypothesis (difference is significant)
if p_value < 0.05:
print(f"Lift is statistically significant (p={p_value:.4f})")
Categorical Metrics (Conversion Rate)
from scipy.stats import chi2_contingency
# Chi-square test for proportions
contingency_table = [
[control_conversions, control_non_conversions],
[treatment_conversions, treatment_non_conversions]
]
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
Real-World Practice: Experiment Best Practices
| Company | Best Practice | Key Insight |
|---|---|---|
| Netflix | Pre-register experiments in XPGUARD before launching | Prevents p-hacking and researcher bias |
| Uber | Minimum sample size threshold; never peek at results | Fixed sample size prevents false positives from peeking |
| Airbnb | Analyze by segment (geography, user cohort, device) | Identify which groups benefit most from change |
| DoorDash | Run experiments for full week cycles (Mon-Sun) to capture day-of-week effects | Temporal patterns matter; control for them |
Key Takeaways
A/B Testing Checklist
- Define hypothesis upfront (before seeing data)
- Calculate required sample size for 80% power
- Check SRM before trusting results
- Never peek at results during experiment
- Use pre-registered primary metrics only
- Understand effect size vs statistical significance
- Replicate results on holdout data before shipping
Where This Connects
This chapter showed how to rigorously test changes in production. In Chapter 5: The Variance Reduction, we'll learn advanced techniques like CUPED to reduce variance and detect smaller effects with fewer samples — making experiments faster and more efficient.