A/B Testing for Ecommerce: Using Data to Optimise Product Pages
Run statistically valid A/B tests on your ecommerce product pages. Covers experiment design, sample size calculation, significance testing, and common pitfalls with Python.
You redesign a product page. Conversion rate goes up. The team celebrates. A week later, conversion drops back to where it was — or lower. What happened?
Without proper A/B testing, you cannot tell the difference between a real improvement and random noise. Most ecommerce teams make changes based on gut feeling, then attribute any movement in metrics to whatever they changed last. That is not optimisation — it is superstition with a dashboard.
This guide builds a rigorous A/B testing workflow for ecommerce product pages: from experiment design to statistical analysis, all in Python.
Who This Is For
- Ecommerce managers making product page changes without knowing if they actually help
- Marketing teams who want to stop guessing and start measuring what converts
- Vibe coders building tools or dashboards that need data-backed decisions behind them
- Store owners spending money on redesigns without proof they generate more revenue
You do not need to be a statistician. This guide explains the maths in plain language and gives you Python code that handles the calculations. If you can read a chart, you can run an A/B test properly.
How A/B Testing Works
Split traffic evenly. Measure the same metric for both groups. Use statistics to determine whether the difference is real or random variation.
What You Will Need
pip install scipy numpy pandas statsmodels
- scipy — statistical tests (chi-squared, z-test)
- numpy — numerical calculations
- pandas — data manipulation
- statsmodels — proportion tests and confidence intervals
Step 1: Define the Experiment
Before writing any code, answer these four questions:
| Question | Example Answer |
|---|---|
| What are you testing? | New product image layout (larger hero, lifestyle shots) |
| What metric decides the winner? | Add-to-cart rate (primary), bounce rate (secondary) |
| How much lift is meaningful? | 5% relative improvement (e.g., 3.0% → 3.15%) |
| How long will you run it? | Until we hit the required sample size (minimum 2 weeks) |
Common Ecommerce Metrics
| Metric | Calculation | Typical Range |
|---|---|---|
| Add-to-cart rate | Carts / Product page views | 3–8% |
| Checkout initiation rate | Checkouts / Carts | 30–60% |
| Purchase conversion rate | Purchases / Sessions | 1–4% |
| Revenue per visitor | Total revenue / Visitors | £1–5 |
| Bounce rate | Single-page sessions / Total sessions | 30–60% |
Pick one primary metric before the test starts. Changing the metric after seeing results is p-hacking.
Step 2: Calculate Sample Size
Running a test without knowing the required sample size is the most common mistake. Too few visitors and you will never detect a real effect:
from scipy.stats import norm
import math
def calculate_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
significance_level: float = 0.05,
power: float = 0.80,
) -> int:
"""Calculate the minimum sample size per variant.
Args:
baseline_rate: Current conversion rate (e.g., 0.03 for 3%)
minimum_detectable_effect: Relative improvement to detect (e.g., 0.05 for 5%)
significance_level: Probability of false positive (default 5%)
power: Probability of detecting a true effect (default 80%)
Returns:
Required sample size per variant (not total)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = norm.ppf(1 - significance_level / 2)
z_beta = norm.ppf(power)
p_avg = (p1 + p2) / 2
numerator = (
z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
+ z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)
Example Calculations
# Scenario 1: Testing add-to-cart rate improvement
size = calculate_sample_size(
baseline_rate=0.05, # 5% current add-to-cart rate
minimum_detectable_effect=0.10, # detect a 10% relative lift (5% → 5.5%)
)
print(f"Need {size:,} visitors per variant")
# => Need 29,069 visitors per variant
# Scenario 2: Small effect on low-converting page
size = calculate_sample_size(
baseline_rate=0.02, # 2% conversion rate
minimum_detectable_effect=0.05, # detect 5% relative lift (2% → 2.1%)
)
print(f"Need {size:,} visitors per variant")
# => Need 309,585 visitors per variant
The numbers are often sobering. Small sites with 1,000 visitors per day need weeks or months to reach significance. This is why you should:
- Test big, bold changes (not button colour tweaks)
- Focus on high-traffic pages first
- Use add-to-cart rate (higher base rate) over purchase rate when possible
Step 3: Collect and Structure Test Data
import pandas as pd
def load_experiment_data(filepath: str) -> pd.DataFrame:
"""Load A/B test event data.
Expected columns: visitor_id, variant (A or B), converted (0 or 1), timestamp
"""
df = pd.read_csv(filepath, parse_dates=["timestamp"])
# Validate data quality
assert set(df["variant"].unique()) <= {"A", "B"}, "Unexpected variants"
assert df["converted"].isin([0, 1]).all(), "Converted must be 0 or 1"
assert df["visitor_id"].is_unique, "Duplicate visitor IDs detected"
return df
def summarise_experiment(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate conversion rate and sample size per variant."""
summary = df.groupby("variant").agg(
visitors=("visitor_id", "count"),
conversions=("converted", "sum"),
)
summary["conversion_rate"] = summary["conversions"] / summary["visitors"]
summary["conversion_rate_pct"] = summary["conversion_rate"] * 100
return summary
Example Output
df = load_experiment_data("experiment_results.csv")
summary = summarise_experiment(df)
print(summary)
visitors conversions conversion_rate conversion_rate_pct
variant
A 15234 762 0.0500 5.00
B 15198 836 0.0550 5.50
Variant B shows 5.50% vs 5.00% — a 10% relative lift. But is it statistically significant?
Step 4: Statistical Significance Testing
Frequentist Approach (Z-Test for Proportions)
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
def test_significance(
conversions_a: int,
visitors_a: int,
conversions_b: int,
visitors_b: int,
significance_level: float = 0.05,
) -> dict:
"""Run a two-proportion z-test.
Returns test results with p-value, confidence interval, and recommendation.
"""
count = np.array([conversions_a, conversions_b])
nobs = np.array([visitors_a, visitors_b])
z_stat, p_value = proportions_ztest(count, nobs, alternative="two-sided")
rate_a = conversions_a / visitors_a
rate_b = conversions_b / visitors_b
relative_lift = (rate_b - rate_a) / rate_a
significant = p_value < significance_level
return {
"control_rate": round(rate_a, 4),
"test_rate": round(rate_b, 4),
"relative_lift": round(relative_lift, 4),
"z_statistic": round(z_stat, 4),
"p_value": round(p_value, 4),
"significant": significant,
"recommendation": (
f"Variant B wins with {relative_lift:.1%} lift (p={p_value:.4f})"
if significant and rate_b > rate_a
else f"No significant difference (p={p_value:.4f})"
),
}
Running the Test
result = test_significance(
conversions_a=762,
visitors_a=15234,
conversions_b=836,
visitors_b=15198,
)
print(result)
{
"control_rate": 0.05,
"test_rate": 0.055,
"relative_lift": 0.0988,
"z_statistic": -2.0148,
"p_value": 0.0439,
"significant": True,
"recommendation": "Variant B wins with 9.9% lift (p=0.0439)"
}
A p-value of 0.0439 means there is a 4.39% chance this result occurred by random chance. Since that is below our 5% threshold, the result is statistically significant.
Step 5: Confidence Intervals
P-values tell you whether there is a difference. Confidence intervals tell you how big it might be:
from statsmodels.stats.proportion import confint_proportions_2indep
def calculate_confidence_interval(
conversions_a: int,
visitors_a: int,
conversions_b: int,
visitors_b: int,
confidence_level: float = 0.95,
) -> dict:
"""Calculate confidence interval for the difference in proportions."""
rate_a = conversions_a / visitors_a
rate_b = conversions_b / visitors_b
diff = rate_b - rate_a
ci_low, ci_high = confint_proportions_2indep(
conversions_b, visitors_b,
conversions_a, visitors_a,
method="wald",
alpha=1 - confidence_level,
)
return {
"difference": round(diff, 4),
"ci_lower": round(ci_low, 4),
"ci_upper": round(ci_high, 4),
"interpretation": (
f"The true difference is between {ci_low:.2%} and {ci_high:.2%} "
f"(95% confidence)"
),
}
ci = calculate_confidence_interval(762, 15234, 836, 15198)
print(ci["interpretation"])
# => The true difference is between 0.02% and 0.98% (95% confidence)
If the confidence interval does not include zero, the result is significant. The width tells you how precise your estimate is — narrow intervals need more data.
Step 6: Common Pitfalls
Pitfall 1: Peeking at Results Early
def is_safe_to_check(df: pd.DataFrame, required_per_variant: int) -> bool:
"""Only analyse results after reaching the required sample size."""
counts = df.groupby("variant")["visitor_id"].count()
min_count = counts.min()
if min_count < required_per_variant:
remaining = required_per_variant - min_count
print(f"Waiting — need {remaining:,} more visitors per variant. Do not check results yet.")
return False
print(f"Ready — reached {min_count:,} per variant (required: {required_per_variant:,}). Safe to analyse.")
return True
Checking results daily and stopping when they look good inflates your false positive rate from 5% to 25% or more.
Pitfall 2: Testing Too Many Variants
| Variants | Adjusted Significance Level | Required Sample (per variant) |
|---|---|---|
| 2 (A/B) | 0.050 | ~29,000 |
| 3 (A/B/C) | 0.025 (Bonferroni) | ~38,000 |
| 4 (A/B/C/D) | 0.017 | ~45,000 |
More variants means more traffic needed. For most ecommerce sites, stick to two variants.
Pitfall 3: Ignoring Segment Effects
def check_segment_consistency(df: pd.DataFrame, segment_col: str) -> pd.DataFrame:
"""Verify the treatment effect is consistent across segments."""
results = []
for segment, group in df.groupby(segment_col):
summary = summarise_experiment(group)
if len(summary) == 2:
rate_a = summary.loc["A", "conversion_rate"]
rate_b = summary.loc["B", "conversion_rate"]
results.append({
"segment": segment,
"rate_a": rate_a,
"rate_b": rate_b,
"lift": (rate_b - rate_a) / rate_a if rate_a > 0 else 0,
"n_a": summary.loc["A", "visitors"],
"n_b": summary.loc["B", "visitors"],
})
return pd.DataFrame(results)
A test that improves conversion on desktop but worsens it on mobile can show a net positive — while hurting your fastest-growing segment.
What This Replaces
| Before (Guesswork) | After (A/B Testing) |
|---|---|
| “The new design looks better" | "Variant B increased add-to-cart by 9.9% (p=0.04)“ |
| Changes based on opinion | Changes backed by statistical evidence |
| No idea if a change helped or hurt | Clear pass/fail with confidence intervals |
| Season or trend mistaken for improvement | Controlled experiment isolates the change |
| Test everything at once | One change per test, measured precisely |
| Celebrate after one week | Wait for statistical significance |
Next Steps
A/B testing is one piece of the data-driven ecommerce stack:
- Feed test results into automated reporting pipelines for stakeholder visibility
- Use the same statistical approach to improve conversion funnels
- Fix site speed issues first — slow pages invalidate test results
- Store experiment data in reliable pipelines for long-term analysis
Start with your highest-traffic product page. Calculate the sample size. Run one clean experiment. The discipline of “prove it with data” will change how your team makes decisions.
Need help setting up A/B testing infrastructure for your ecommerce store? Get in touch or explore our ecommerce optimisation services.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.
Get in touch