A/B Testing for Ecommerce: Using Data to Optimise Product Pages

·8 min read·Ecommerce

Run statistically valid A/B tests on your ecommerce product pages. Covers experiment design, sample size calculation, significance testing, and common pitfalls with Python.

A/B Testing for Ecommerce: Using Data to Optimise Product PagesAI Generated Image

You redesign a product page. Conversion rate goes up. The team celebrates. A week later, conversion drops back to where it was — or lower. What happened?

Without proper A/B testing, you cannot tell the difference between a real improvement and random noise. Most ecommerce teams make changes based on gut feeling, then attribute any movement in metrics to whatever they changed last. That is not optimisation — it is superstition with a dashboard.

This guide builds a rigorous A/B testing workflow for ecommerce product pages: from experiment design to statistical analysis, all in Python.

# Who This Is For

  • Ecommerce managers making product page changes without knowing if they actually help
  • Marketing teams who want to stop guessing and start measuring what converts
  • Vibe coders building tools or dashboards that need data-backed decisions behind them
  • Store owners spending money on redesigns without proof they generate more revenue

You do not need to be a statistician. This guide explains the maths in plain language and gives you Python code that handles the calculations. If you can read a chart, you can run an A/B test properly.

# How A/B Testing Works

flowchart LR
  V["Visitor arrives"] --> S["Split\n(50/50)"]
  S --> A["Variant A\n(Control)"]
  S --> B["Variant B\n(Test)"]
  A --> MA["Measure\nConversion"]
  B --> MB["Measure\nConversion"]
  MA --> AN["Analyse\n(Statistical test)"]
  MB --> AN
  AN --> D{"Significant?"}
  D -- Yes --> R["Roll out winner"]
  D -- No --> C["Continue or stop"]

Split traffic evenly. Measure the same metric for both groups. Use statistics to determine whether the difference is real or random variation.

# What You Will Need

bash
pip install scipy numpy pandas statsmodels
  • scipy — statistical tests (chi-squared, z-test)
  • numpy — numerical calculations
  • pandas — data manipulation
  • statsmodels — proportion tests and confidence intervals

# Step 1: Define the Experiment

Before writing any code, answer these four questions:

Question Example Answer
What are you testing? New product image layout (larger hero, lifestyle shots)
What metric decides the winner? Add-to-cart rate (primary), bounce rate (secondary)
How much lift is meaningful? 5% relative improvement (e.g., 3.0% → 3.15%)
How long will you run it? Until we hit the required sample size (minimum 2 weeks)

# Common Ecommerce Metrics

Metric Calculation Typical Range
Add-to-cart rate Carts / Product page views 3–8%
Checkout initiation rate Checkouts / Carts 30–60%
Purchase conversion rate Purchases / Sessions 1–4%
Revenue per visitor Total revenue / Visitors £1–5
Bounce rate Single-page sessions / Total sessions 30–60%

Pick one primary metric before the test starts. Changing the metric after seeing results is p-hacking.

# Step 2: Calculate Sample Size

Running a test without knowing the required sample size is the most common mistake. Too few visitors and you will never detect a real effect:

python
from scipy.stats import norm
import math

def calculate_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    significance_level: float = 0.05,
    power: float = 0.80,
) -> int:
    """Calculate the minimum sample size per variant.

    Args:
        baseline_rate: Current conversion rate (e.g., 0.03 for 3%)
        minimum_detectable_effect: Relative improvement to detect (e.g., 0.05 for 5%)
        significance_level: Probability of false positive (default 5%)
        power: Probability of detecting a true effect (default 80%)

    Returns:
        Required sample size per variant (not total)
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)

    z_alpha = norm.ppf(1 - significance_level / 2)
    z_beta = norm.ppf(power)

    p_avg = (p1 + p2) / 2

    numerator = (
        z_alpha * math.sqrt(2 * p_avg * (1 - p_avg))
        + z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
    ) ** 2
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)

# Example Calculations

python
# Scenario 1: Testing add-to-cart rate improvement
size = calculate_sample_size(
    baseline_rate=0.05,       # 5% current add-to-cart rate
    minimum_detectable_effect=0.10,  # detect a 10% relative lift (5% → 5.5%)
)
print(f"Need {size:,} visitors per variant")
# => Need 29,069 visitors per variant

# Scenario 2: Small effect on low-converting page
size = calculate_sample_size(
    baseline_rate=0.02,       # 2% conversion rate
    minimum_detectable_effect=0.05,  # detect 5% relative lift (2% → 2.1%)
)
print(f"Need {size:,} visitors per variant")
# => Need 309,585 visitors per variant

The numbers are often sobering. Small sites with 1,000 visitors per day need weeks or months to reach significance. This is why you should:

  • Test big, bold changes (not button colour tweaks)
  • Focus on high-traffic pages first
  • Use add-to-cart rate (higher base rate) over purchase rate when possible

# Step 3: Collect and Structure Test Data

python
import pandas as pd

def load_experiment_data(filepath: str) -> pd.DataFrame:
    """Load A/B test event data.

    Expected columns: visitor_id, variant (A or B), converted (0 or 1), timestamp
    """
    df = pd.read_csv(filepath, parse_dates=["timestamp"])

    # Validate data quality
    assert set(df["variant"].unique()) <= {"A", "B"}, "Unexpected variants"
    assert df["converted"].isin([0, 1]).all(), "Converted must be 0 or 1"
    assert df["visitor_id"].is_unique, "Duplicate visitor IDs detected"

    return df

def summarise_experiment(df: pd.DataFrame) -> pd.DataFrame:
    """Calculate conversion rate and sample size per variant."""
    summary = df.groupby("variant").agg(
        visitors=("visitor_id", "count"),
        conversions=("converted", "sum"),
    )
    summary["conversion_rate"] = summary["conversions"] / summary["visitors"]
    summary["conversion_rate_pct"] = summary["conversion_rate"] * 100
    return summary

# Example Output

python
df = load_experiment_data("experiment_results.csv")
summary = summarise_experiment(df)
print(summary)
text
         visitors  conversions  conversion_rate  conversion_rate_pct
variant
A           15234          762            0.0500                5.00
B           15198          836            0.0550                5.50

Variant B shows 5.50% vs 5.00% — a 10% relative lift. But is it statistically significant?

# Step 4: Statistical Significance Testing

# Frequentist Approach (Z-Test for Proportions)

python
from statsmodels.stats.proportion import proportions_ztest
import numpy as np

def test_significance(
    conversions_a: int,
    visitors_a: int,
    conversions_b: int,
    visitors_b: int,
    significance_level: float = 0.05,
) -> dict:
    """Run a two-proportion z-test.

    Returns test results with p-value, confidence interval, and recommendation.
    """
    count = np.array([conversions_a, conversions_b])
    nobs = np.array([visitors_a, visitors_b])

    z_stat, p_value = proportions_ztest(count, nobs, alternative="two-sided")

    rate_a = conversions_a / visitors_a
    rate_b = conversions_b / visitors_b
    relative_lift = (rate_b - rate_a) / rate_a

    significant = p_value < significance_level

    return {
        "control_rate": round(rate_a, 4),
        "test_rate": round(rate_b, 4),
        "relative_lift": round(relative_lift, 4),
        "z_statistic": round(z_stat, 4),
        "p_value": round(p_value, 4),
        "significant": significant,
        "recommendation": (
            f"Variant B wins with {relative_lift:.1%} lift (p={p_value:.4f})"
            if significant and rate_b > rate_a
            else f"No significant difference (p={p_value:.4f})"
        ),
    }

# Running the Test

python
result = test_significance(
    conversions_a=762,
    visitors_a=15234,
    conversions_b=836,
    visitors_b=15198,
)
print(result)
python
{
    "control_rate": 0.05,
    "test_rate": 0.055,
    "relative_lift": 0.0988,
    "z_statistic": -2.0148,
    "p_value": 0.0439,
    "significant": True,
    "recommendation": "Variant B wins with 9.9% lift (p=0.0439)"
}

A p-value of 0.0439 means there is a 4.39% chance this result occurred by random chance. Since that is below our 5% threshold, the result is statistically significant.

# Step 5: Confidence Intervals

P-values tell you whether there is a difference. Confidence intervals tell you how big it might be:

python
from statsmodels.stats.proportion import confint_proportions_2indep

def calculate_confidence_interval(
    conversions_a: int,
    visitors_a: int,
    conversions_b: int,
    visitors_b: int,
    confidence_level: float = 0.95,
) -> dict:
    """Calculate confidence interval for the difference in proportions."""
    rate_a = conversions_a / visitors_a
    rate_b = conversions_b / visitors_b
    diff = rate_b - rate_a

    ci_low, ci_high = confint_proportions_2indep(
        conversions_b, visitors_b,
        conversions_a, visitors_a,
        method="wald",
        alpha=1 - confidence_level,
    )

    return {
        "difference": round(diff, 4),
        "ci_lower": round(ci_low, 4),
        "ci_upper": round(ci_high, 4),
        "interpretation": (
            f"The true difference is between {ci_low:.2%} and {ci_high:.2%} "
            f"(95% confidence)"
        ),
    }
python
ci = calculate_confidence_interval(762, 15234, 836, 15198)
print(ci["interpretation"])
# => The true difference is between 0.02% and 0.98% (95% confidence)

If the confidence interval does not include zero, the result is significant. The width tells you how precise your estimate is — narrow intervals need more data.

# Step 6: Common Pitfalls

# Pitfall 1: Peeking at Results Early

python
def is_safe_to_check(df: pd.DataFrame, required_per_variant: int) -> bool:
    """Only analyse results after reaching the required sample size."""
    counts = df.groupby("variant")["visitor_id"].count()
    min_count = counts.min()

    if min_count < required_per_variant:
        remaining = required_per_variant - min_count
        print(f"Waiting — need {remaining:,} more visitors per variant. Do not check results yet.")
        return False

    print(f"Ready — reached {min_count:,} per variant (required: {required_per_variant:,}). Safe to analyse.")
    return True

Checking results daily and stopping when they look good inflates your false positive rate from 5% to 25% or more.

# Pitfall 2: Testing Too Many Variants

Variants Adjusted Significance Level Required Sample (per variant)
2 (A/B) 0.050 ~29,000
3 (A/B/C) 0.025 (Bonferroni) ~38,000
4 (A/B/C/D) 0.017 ~45,000

More variants means more traffic needed. For most ecommerce sites, stick to two variants.

# Pitfall 3: Ignoring Segment Effects

python
def check_segment_consistency(df: pd.DataFrame, segment_col: str) -> pd.DataFrame:
    """Verify the treatment effect is consistent across segments."""
    results = []

    for segment, group in df.groupby(segment_col):
        summary = summarise_experiment(group)

        if len(summary) == 2:
            rate_a = summary.loc["A", "conversion_rate"]
            rate_b = summary.loc["B", "conversion_rate"]
            results.append({
                "segment": segment,
                "rate_a": rate_a,
                "rate_b": rate_b,
                "lift": (rate_b - rate_a) / rate_a if rate_a > 0 else 0,
                "n_a": summary.loc["A", "visitors"],
                "n_b": summary.loc["B", "visitors"],
            })

    return pd.DataFrame(results)

A test that improves conversion on desktop but worsens it on mobile can show a net positive — while hurting your fastest-growing segment.

# What This Replaces

Before (Guesswork) After (A/B Testing)
"The new design looks better" "Variant B increased add-to-cart by 9.9% (p=0.04)"
Changes based on opinion Changes backed by statistical evidence
No idea if a change helped or hurt Clear pass/fail with confidence intervals
Season or trend mistaken for improvement Controlled experiment isolates the change
Test everything at once One change per test, measured precisely
Celebrate after one week Wait for statistical significance

# Next Steps

A/B testing is one piece of the data-driven ecommerce stack:

Start with your highest-traffic product page. Calculate the sample size. Run one clean experiment. The discipline of "prove it with data" will change how your team makes decisions.

Need help setting up A/B testing infrastructure for your ecommerce store? Get in touch or explore our ecommerce optimisation services.

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

ab testing ecommerceecommerce product page optimizationstatistical significance testingpython ab test analysisconversion rate optimizationecommerce experiment designsample size calculatorbayesian ab testing pythonecommerce data analysissplit testing product pages