A/B Testing for Ecommerce: Using Data to Optimise Product Pages
Run statistically valid A/B tests on your ecommerce product pages. Covers experiment design, sample size calculation, significance testing, and common pitfalls with Python.
AI Generated ImageYou redesign a product page. Conversion rate goes up. The team celebrates. A week later, conversion drops back to where it was — or lower. What happened?
Without proper A/B testing, you cannot tell the difference between a real improvement and random noise. Most ecommerce teams make changes based on gut feeling, then attribute any movement in metrics to whatever they changed last. That is not optimisation — it is superstition with a dashboard.
This guide builds a rigorous A/B testing workflow for ecommerce product pages: from experiment design to statistical analysis, all in Python.
# Who This Is For
- Ecommerce managers making product page changes without knowing if they actually help
- Marketing teams who want to stop guessing and start measuring what converts
- Vibe coders building tools or dashboards that need data-backed decisions behind them
- Store owners spending money on redesigns without proof they generate more revenue
You do not need to be a statistician. This guide explains the maths in plain language and gives you Python code that handles the calculations. If you can read a chart, you can run an A/B test properly.
# How A/B Testing Works
flowchart LR
V["Visitor arrives"] --> S["Split\n(50/50)"]
S --> A["Variant A\n(Control)"]
S --> B["Variant B\n(Test)"]
A --> MA["Measure\nConversion"]
B --> MB["Measure\nConversion"]
MA --> AN["Analyse\n(Statistical test)"]
MB --> AN
AN --> D{"Significant?"}
D -- Yes --> R["Roll out winner"]
D -- No --> C["Continue or stop"]Split traffic evenly. Measure the same metric for both groups. Use statistics to determine whether the difference is real or random variation.
# What You Will Need
pip install scipy numpy pandas statsmodels
- scipy — statistical tests (chi-squared, z-test)
- numpy — numerical calculations
- pandas — data manipulation
- statsmodels — proportion tests and confidence intervals
# Step 1: Define the Experiment
Before writing any code, answer these four questions:
| Question | Example Answer |
|---|---|
| What are you testing? | New product image layout (larger hero, lifestyle shots) |
| What metric decides the winner? | Add-to-cart rate (primary), bounce rate (secondary) |
| How much lift is meaningful? | 5% relative improvement (e.g., 3.0% → 3.15%) |
| How long will you run it? | Until we hit the required sample size (minimum 2 weeks) |
# Common Ecommerce Metrics
| Metric | Calculation | Typical Range |
|---|---|---|
| Add-to-cart rate | Carts / Product page views | 3–8% |
| Checkout initiation rate | Checkouts / Carts | 30–60% |
| Purchase conversion rate | Purchases / Sessions | 1–4% |
| Revenue per visitor | Total revenue / Visitors | £1–5 |
| Bounce rate | Single-page sessions / Total sessions | 30–60% |
Pick one primary metric before the test starts. Changing the metric after seeing results is p-hacking.
# Step 2: Calculate Sample Size
Running a test without knowing the required sample size is the most common mistake. Too few visitors and you will never detect a real effect:
from scipy.stats import norm
import math
def calculate_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
significance_level: float = 0.05,
power: float = 0.80,
) -> int:
"""Calculate the minimum sample size per variant.
Args:
baseline_rate: Current conversion rate (e.g., 0.03 for 3%)
minimum_detectable_effect: Relative improvement to detect (e.g., 0.05 for 5%)
significance_level: Probability of false positive (default 5%)
power: Probability of detecting a true effect (default 80%)
Returns:
Required sample size per variant (not total)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = norm.ppf(1 - significance_level / 2)
z_beta = norm.ppf(power)
p_avg = (p1 + p2) / 2
numerator = (
z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
+ z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)
# Example Calculations
# Scenario 1: Testing add-to-cart rate improvement
size = calculate_sample_size(
baseline_rate=0.05, # 5% current add-to-cart rate
minimum_detectable_effect=0.10, # detect a 10% relative lift (5% → 5.5%)
)
print(f"Need {size:,} visitors per variant")
# => Need 29,069 visitors per variant
# Scenario 2: Small effect on low-converting page
size = calculate_sample_size(
baseline_rate=0.02, # 2% conversion rate
minimum_detectable_effect=0.05, # detect 5% relative lift (2% → 2.1%)
)
print(f"Need {size:,} visitors per variant")
# => Need 309,585 visitors per variant
The numbers are often sobering. Small sites with 1,000 visitors per day need weeks or months to reach significance. This is why you should:
- Test big, bold changes (not button colour tweaks)
- Focus on high-traffic pages first
- Use add-to-cart rate (higher base rate) over purchase rate when possible
# Step 3: Collect and Structure Test Data
import pandas as pd
def load_experiment_data(filepath: str) -> pd.DataFrame:
"""Load A/B test event data.
Expected columns: visitor_id, variant (A or B), converted (0 or 1), timestamp
"""
df = pd.read_csv(filepath, parse_dates=["timestamp"])
# Validate data quality
assert set(df["variant"].unique()) <= {"A", "B"}, "Unexpected variants"
assert df["converted"].isin([0, 1]).all(), "Converted must be 0 or 1"
assert df["visitor_id"].is_unique, "Duplicate visitor IDs detected"
return df
def summarise_experiment(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate conversion rate and sample size per variant."""
summary = df.groupby("variant").agg(
visitors=("visitor_id", "count"),
conversions=("converted", "sum"),
)
summary["conversion_rate"] = summary["conversions"] / summary["visitors"]
summary["conversion_rate_pct"] = summary["conversion_rate"] * 100
return summary
# Example Output
df = load_experiment_data("experiment_results.csv")
summary = summarise_experiment(df)
print(summary)
visitors conversions conversion_rate conversion_rate_pct
variant
A 15234 762 0.0500 5.00
B 15198 836 0.0550 5.50
Variant B shows 5.50% vs 5.00% — a 10% relative lift. But is it statistically significant?
# Step 4: Statistical Significance Testing
# Frequentist Approach (Z-Test for Proportions)
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
def test_significance(
conversions_a: int,
visitors_a: int,
conversions_b: int,
visitors_b: int,
significance_level: float = 0.05,
) -> dict:
"""Run a two-proportion z-test.
Returns test results with p-value, confidence interval, and recommendation.
"""
count = np.array([conversions_a, conversions_b])
nobs = np.array([visitors_a, visitors_b])
z_stat, p_value = proportions_ztest(count, nobs, alternative="two-sided")
rate_a = conversions_a / visitors_a
rate_b = conversions_b / visitors_b
relative_lift = (rate_b - rate_a) / rate_a
significant = p_value < significance_level
return {
"control_rate": round(rate_a, 4),
"test_rate": round(rate_b, 4),
"relative_lift": round(relative_lift, 4),
"z_statistic": round(z_stat, 4),
"p_value": round(p_value, 4),
"significant": significant,
"recommendation": (
f"Variant B wins with {relative_lift:.1%} lift (p={p_value:.4f})"
if significant and rate_b > rate_a
else f"No significant difference (p={p_value:.4f})"
),
}
# Running the Test
result = test_significance(
conversions_a=762,
visitors_a=15234,
conversions_b=836,
visitors_b=15198,
)
print(result)
{
"control_rate": 0.05,
"test_rate": 0.055,
"relative_lift": 0.0988,
"z_statistic": -2.0148,
"p_value": 0.0439,
"significant": True,
"recommendation": "Variant B wins with 9.9% lift (p=0.0439)"
}
A p-value of 0.0439 means there is a 4.39% chance this result occurred by random chance. Since that is below our 5% threshold, the result is statistically significant.
# Step 5: Confidence Intervals
P-values tell you whether there is a difference. Confidence intervals tell you how big it might be:
from statsmodels.stats.proportion import confint_proportions_2indep
def calculate_confidence_interval(
conversions_a: int,
visitors_a: int,
conversions_b: int,
visitors_b: int,
confidence_level: float = 0.95,
) -> dict:
"""Calculate confidence interval for the difference in proportions."""
rate_a = conversions_a / visitors_a
rate_b = conversions_b / visitors_b
diff = rate_b - rate_a
ci_low, ci_high = confint_proportions_2indep(
conversions_b, visitors_b,
conversions_a, visitors_a,
method="wald",
alpha=1 - confidence_level,
)
return {
"difference": round(diff, 4),
"ci_lower": round(ci_low, 4),
"ci_upper": round(ci_high, 4),
"interpretation": (
f"The true difference is between {ci_low:.2%} and {ci_high:.2%} "
f"(95% confidence)"
),
}
ci = calculate_confidence_interval(762, 15234, 836, 15198)
print(ci["interpretation"])
# => The true difference is between 0.02% and 0.98% (95% confidence)
If the confidence interval does not include zero, the result is significant. The width tells you how precise your estimate is — narrow intervals need more data.
# Step 6: Common Pitfalls
# Pitfall 1: Peeking at Results Early
def is_safe_to_check(df: pd.DataFrame, required_per_variant: int) -> bool:
"""Only analyse results after reaching the required sample size."""
counts = df.groupby("variant")["visitor_id"].count()
min_count = counts.min()
if min_count < required_per_variant:
remaining = required_per_variant - min_count
print(f"Waiting — need {remaining:,} more visitors per variant. Do not check results yet.")
return False
print(f"Ready — reached {min_count:,} per variant (required: {required_per_variant:,}). Safe to analyse.")
return True
Checking results daily and stopping when they look good inflates your false positive rate from 5% to 25% or more.
# Pitfall 2: Testing Too Many Variants
| Variants | Adjusted Significance Level | Required Sample (per variant) |
|---|---|---|
| 2 (A/B) | 0.050 | ~29,000 |
| 3 (A/B/C) | 0.025 (Bonferroni) | ~38,000 |
| 4 (A/B/C/D) | 0.017 | ~45,000 |
More variants means more traffic needed. For most ecommerce sites, stick to two variants.
# Pitfall 3: Ignoring Segment Effects
def check_segment_consistency(df: pd.DataFrame, segment_col: str) -> pd.DataFrame:
"""Verify the treatment effect is consistent across segments."""
results = []
for segment, group in df.groupby(segment_col):
summary = summarise_experiment(group)
if len(summary) == 2:
rate_a = summary.loc["A", "conversion_rate"]
rate_b = summary.loc["B", "conversion_rate"]
results.append({
"segment": segment,
"rate_a": rate_a,
"rate_b": rate_b,
"lift": (rate_b - rate_a) / rate_a if rate_a > 0 else 0,
"n_a": summary.loc["A", "visitors"],
"n_b": summary.loc["B", "visitors"],
})
return pd.DataFrame(results)
A test that improves conversion on desktop but worsens it on mobile can show a net positive — while hurting your fastest-growing segment.
# What This Replaces
| Before (Guesswork) | After (A/B Testing) |
|---|---|
| "The new design looks better" | "Variant B increased add-to-cart by 9.9% (p=0.04)" |
| Changes based on opinion | Changes backed by statistical evidence |
| No idea if a change helped or hurt | Clear pass/fail with confidence intervals |
| Season or trend mistaken for improvement | Controlled experiment isolates the change |
| Test everything at once | One change per test, measured precisely |
| Celebrate after one week | Wait for statistical significance |
# Next Steps
A/B testing is one piece of the data-driven ecommerce stack:
- Feed test results into automated reporting pipelines for stakeholder visibility
- Use the same statistical approach to improve conversion funnels
- Fix site speed issues first — slow pages invalidate test results
- Store experiment data in reliable pipelines for long-term analysis
Start with your highest-traffic product page. Calculate the sample size. Run one clean experiment. The discipline of "prove it with data" will change how your team makes decisions.
Need help setting up A/B testing infrastructure for your ecommerce store? Get in touch or explore our ecommerce optimisation services.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.