A/B Testing in Python: Power Analysis, Frequentist and Bayesian Methods with SciPy and PyMC

Learn how to run A/B tests in Python from start to finish — power analysis with statsmodels, frequentist hypothesis testing with SciPy 1.17, and Bayesian analysis with PyMC 5.28. Includes working code, decision frameworks, and common pitfalls to avoid.

Why A/B Testing Is a Core Data Science Skill

You've built a new recommendation engine, redesigned a checkout flow, or tweaked pricing. The question that actually matters isn't whether the change looks better on a dashboard — it's whether the difference is real or just noise. A/B testing gives you a principled framework for answering that question, and Python gives you every tool you need to run the analysis end to end.

This guide walks you through the full A/B testing workflow in Python — from calculating the sample size you actually need, through running frequentist tests with SciPy 1.17, to building a Bayesian model in PyMC 5.28 that gives you a direct probability of one variant beating another. Every code block is self-contained, uses 2026-current library versions, and can be dropped straight into a Jupyter notebook.

Setting Up Your Environment

All examples use Python 3.12+ with the following packages:

pip install numpy pandas scipy statsmodels pymc arviz matplotlib

Go ahead and import everything up front so you're not hunting for missing imports later:

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.power import NormalIndPower, TTestIndPower
from statsmodels.stats.proportion import proportions_ztest
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

Step 1: Define Your Hypothesis and Metrics

Every A/B test starts with a clear hypothesis. Vague goals lead to vague results — I've seen teams burn weeks of traffic on tests that couldn't answer anything because nobody wrote down what "better" meant. So, write it down before you touch any code.

Formulating Hypotheses

Use the standard null/alternative framework:

  • Null hypothesis (H₀): There's no difference between the control (A) and the variant (B). Any observed difference is due to random chance.
  • Alternative hypothesis (H₁): The variant produces a measurably different outcome — typically a higher conversion rate, lower churn, or increased revenue per user.

For example, if you're testing a new onboarding flow:

  • H₀: The new onboarding flow has the same signup completion rate as the current flow.
  • H₁: The new onboarding flow has a higher signup completion rate.

Choosing Your Primary Metric

Pick one primary metric before the test starts. Just one. Common choices include:

Metric TypeExampleStatistical Test
Binary (proportion)Conversion rate, click-through rateZ-test for proportions
Continuous (mean)Revenue per user, session durationTwo-sample t-test or Welch's t-test
Count dataNumber of page views per sessionPoisson or negative binomial test

Adding secondary metrics is fine, but be aware of the multiple comparisons problem — the more metrics you test, the higher the chance of a false positive. Apply a Bonferroni correction or control the false discovery rate if you test more than one. (We'll cover how to do this in code later.)

Step 2: Power Analysis — Calculate the Sample Size You Need

Running a test without a power analysis is like flying blind. You need enough users to detect a meaningful difference, but not so many that you waste weeks waiting. Power analysis tells you exactly how many observations each group needs.

Honestly, this is the step most people skip — and it's the one that bites them hardest.

The Four Parameters

Every power calculation involves four numbers. Fix any three and solve for the fourth:

  1. Effect size: The minimum improvement you care about detecting. For proportions, this is often expressed as Cohen's h; for continuous outcomes, as Cohen's d.
  2. Significance level (α): The false positive rate — usually 0.05.
  3. Statistical power (1 − β): The probability of correctly detecting a real effect — usually 0.80 or 0.90.
  4. Sample size (n): The number of observations per group.

Sample Size for a Proportions Test

Suppose your current conversion rate is 12% and you want to detect an improvement to at least 14% (a 2 percentage-point lift). Here's how to calculate the required sample size:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Baseline and expected conversion rates
p_control = 0.12
p_variant = 0.14

# Cohen's h effect size for two proportions
effect_size = proportion_effectsize(p_control, p_variant)
print(f"Cohen's h effect size: {effect_size:.4f}")

# Solve for sample size
power_analysis = NormalIndPower()
sample_size = power_analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.80,
    ratio=1.0,           # equal group sizes
    alternative="two-sided"
)
print(f"Required sample size per group: {int(np.ceil(sample_size))}")
print(f"Total participants needed: {int(np.ceil(sample_size)) * 2}")

For this example, you'd need roughly 3,623 users per group (7,246 total). That number often surprises people — small effects require large samples. If your site gets 500 visitors a day, you're looking at about two weeks of data collection.

Sample Size for a Continuous Metric

If your primary metric is continuous (for example, average order value), use TTestIndPower instead:

from statsmodels.stats.power import TTestIndPower

# Cohen's d: (mean_B - mean_A) / pooled_std
# Suppose you want to detect a $3 lift on a $50 average with std=$15
cohens_d = 3 / 15  # 0.20, a "small" effect

power_analysis = TTestIndPower()
sample_size = power_analysis.solve_power(
    effect_size=cohens_d,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative="two-sided"
)
print(f"Required sample size per group: {int(np.ceil(sample_size))}")

Visualize How Sample Size Changes with Effect Size

Power curves are great for communicating trade-offs to stakeholders. The smaller the effect you want to detect, the more users you need — and this chart makes that relationship visceral:

effect_sizes = np.array([0.1, 0.2, 0.3, 0.5])
sample_range = np.arange(20, 2000)

fig, ax = plt.subplots(figsize=(9, 5))
power_obj = TTestIndPower()

for es in effect_sizes:
    powers = [power_obj.power(es, n, 0.05) for n in sample_range]
    ax.plot(sample_range, powers, label=f"d = {es}")

ax.axhline(0.8, color="gray", linestyle="--", label="80% power")
ax.set_xlabel("Sample size per group")
ax.set_ylabel("Statistical power")
ax.set_title("Power Curves for Two-Sample t-Test")
ax.legend()
plt.tight_layout()
plt.show()

Step 3: Simulate and Prepare Your Data

In a real project, your data comes from a logging pipeline or analytics platform. For this tutorial, we'll simulate a realistic dataset so you can run every code block locally without needing access to production data.

# Simulate an A/B test on signup conversion
n_control = 4000
n_variant = 4000

# True conversion rates (unknown to us in practice)
true_rate_control = 0.120
true_rate_variant = 0.138  # ~15% relative lift

control_conversions = np.random.binomial(1, true_rate_control, n_control)
variant_conversions = np.random.binomial(1, true_rate_variant, n_variant)

df = pd.DataFrame({
    "group": ["control"] * n_control + ["variant"] * n_variant,
    "converted": np.concatenate([control_conversions, variant_conversions])
})

summary = df.groupby("group")["converted"].agg(["mean", "sum", "count"])
print(summary)

variant_mean = df[df.group == "variant"]["converted"].mean()
control_mean = df[df.group == "control"]["converted"].mean()
print(f"\nObserved lift: {variant_mean / control_mean - 1:.1%}")

Step 4: Frequentist A/B Testing with SciPy

The frequentist approach asks: "If there were truly no difference between A and B, how likely is the data I observed?" If that probability (the p-value) is very low, you reject the null hypothesis.

It's the workhorse approach — fast, well-understood, and perfectly fine for most use cases.

Z-Test for Proportions (Conversion Rates)

For binary outcomes like conversion rates, the two-proportion z-test is the standard choice. Statsmodels gives you a clean implementation:

from statsmodels.stats.proportion import proportions_ztest, proportion_confint

# Aggregate results
control = df[df.group == "control"]["converted"]
variant = df[df.group == "variant"]["converted"]

successes = np.array([control.sum(), variant.sum()])
nobs = np.array([len(control), len(variant)])

# Two-proportion z-test
z_stat, p_value = proportions_ztest(successes, nobs, alternative="two-sided")
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Confidence intervals for each group
ci_control = proportion_confint(successes[0], nobs[0], alpha=0.05, method="wilson")
ci_variant = proportion_confint(successes[1], nobs[1], alpha=0.05, method="wilson")
print(f"\nControl rate: {successes[0]/nobs[0]:.4f}  95% CI: [{ci_control[0]:.4f}, {ci_control[1]:.4f}]")
print(f"Variant rate: {successes[1]/nobs[1]:.4f}  95% CI: [{ci_variant[0]:.4f}, {ci_variant[1]:.4f}]")

if p_value < 0.05:
    print("\nResult is statistically significant at alpha = 0.05. Reject H0.")
else:
    print("\nResult is NOT statistically significant. Fail to reject H0.")

The Wilson confidence interval (specified by method="wilson") is preferred over the Wald interval for proportions because it provides better coverage, especially when the sample proportion is close to 0 or 1. It's one of those defaults you should just always use.

Welch's t-Test for Continuous Metrics

When your primary metric is continuous — revenue per user, time on page, average order value — use Welch's t-test. Unlike the standard Student's t-test, it doesn't assume equal variances between groups, which makes it more robust in practice:

# Simulate continuous data: revenue per user
revenue_control = np.random.lognormal(mean=3.5, sigma=0.8, size=4000)
revenue_variant = np.random.lognormal(mean=3.55, sigma=0.8, size=4000)

t_stat, p_value = stats.ttest_ind(revenue_control, revenue_variant, equal_var=False)
print(f"Welch's t-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

mean_diff = revenue_variant.mean() - revenue_control.mean()
print(f"Mean difference: ${mean_diff:.2f}")
print(f"Control mean: ${revenue_control.mean():.2f}")
print(f"Variant mean: ${revenue_variant.mean():.2f}")

Confidence Interval for the Difference

A p-value tells you whether a difference exists, but the confidence interval tells you how large it might be — and that's usually what stakeholders actually care about:

from scipy.stats import sem

diff = revenue_variant.mean() - revenue_control.mean()
se_diff = np.sqrt(sem(revenue_control)**2 + sem(revenue_variant)**2)
ci_low = diff - 1.96 * se_diff
ci_high = diff + 1.96 * se_diff

print(f"Estimated lift: ${diff:.2f}")
print(f"95% CI for the difference: [${ci_low:.2f}, ${ci_high:.2f}]")

Step 5: Bayesian A/B Testing with PyMC

The Bayesian approach answers a fundamentally different question: "Given the data I observed, what is the probability that variant B is better than variant A?" Instead of a binary significant/not-significant verdict, you get a full probability distribution over the possible conversion rates — and a direct probability that one variant wins.

This is where things get really interesting, in my opinion.

Why Go Bayesian?

  • Direct probability statements: "There's a 94% probability that variant B converts higher" is far more actionable than "p = 0.03." Try explaining a p-value to a product manager — it's not fun.
  • No fixed sample size required: You can check results as data comes in without inflating your false positive rate (the so-called "peeking problem").
  • Incorporate prior knowledge: If you have historical data about typical conversion rates, you can encode that as a prior distribution.
  • Quantify uncertainty on the lift: You get a credible interval for the exact size of the improvement, not just a point estimate.

Building the Model in PyMC

We'll model each group's conversion rate as a Beta-distributed random variable, with Bernoulli-distributed observations:

import pymc as pm
import arviz as az

# Use the same simulated data from Step 3
control_data = df[df.group == "control"]["converted"].values
variant_data = df[df.group == "variant"]["converted"].values

with pm.Model() as ab_model:
    # Weakly informative priors — Beta(2, 18) centered around 10%
    p_control = pm.Beta("p_control", alpha=2, beta=18)
    p_variant = pm.Beta("p_variant", alpha=2, beta=18)

    # Likelihood
    obs_control = pm.Bernoulli("obs_control", p=p_control, observed=control_data)
    obs_variant = pm.Bernoulli("obs_variant", p=p_variant, observed=variant_data)

    # Derived quantities
    delta = pm.Deterministic("delta", p_variant - p_control)
    relative_lift = pm.Deterministic("relative_lift", delta / p_control)

    # Sample from the posterior
    trace = pm.sample(2000, tune=1000, random_seed=42, return_inferencedata=True)

The Beta(2, 18) prior is centered around 10%, which is a reasonable starting point for many web conversion rates. If your baseline is very different, adjust accordingly — but with 4,000+ observations, the prior gets overwhelmed by the data pretty quickly.

Interpreting the Posterior

ArviZ makes it straightforward to visualize and summarize the posterior distributions:

# Summary statistics
summary = az.summary(trace, var_names=["p_control", "p_variant", "delta", "relative_lift"])
print(summary)

# Posterior plots
az.plot_posterior(
    trace,
    var_names=["p_control", "p_variant"],
    hdi_prob=0.95,
    figsize=(12, 4)
)
plt.suptitle("Posterior Distributions of Conversion Rates", y=1.02)
plt.tight_layout()
plt.show()

Computing the Probability That B Beats A

This is the number your stakeholders actually want — and honestly, it's the most satisfying part of the whole analysis:

delta_samples = trace.posterior["delta"].values.flatten()

prob_b_better = (delta_samples > 0).mean()
print(f"Probability that variant B is better than A: {prob_b_better:.2%}")

# What is the expected lift?
mean_lift = delta_samples.mean()
hdi = az.hdi(trace, var_names=["delta"], hdi_prob=0.95)
print(f"Expected absolute lift: {mean_lift:.4f}")
print(f"95% HDI for the lift: [{hdi['delta'].values[0]:.4f}, {hdi['delta'].values[1]:.4f}]")

# Visualize the lift distribution
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(delta_samples, bins=80, density=True, alpha=0.7, color="steelblue")
ax.axvline(0, color="red", linestyle="--", label="No difference")
ax.set_xlabel("Conversion rate difference (B - A)")
ax.set_ylabel("Density")
ax.set_title(f"Posterior of delta — P(B > A) = {prob_b_better:.1%}")
ax.legend()
plt.tight_layout()
plt.show()

Decision Rules with Expected Loss

A probability alone doesn't account for the magnitude of a wrong decision. Expected loss gives you a risk-aware decision rule: choose the variant whose expected loss is lower. Think of it as asking "how bad could it be if I'm wrong?"

# Expected loss if we choose variant B but A is actually better
loss_choose_b = np.maximum(-delta_samples, 0).mean()
# Expected loss if we choose control A but B is actually better
loss_choose_a = np.maximum(delta_samples, 0).mean()

print(f"Expected loss from choosing variant B: {loss_choose_b:.5f}")
print(f"Expected loss from choosing control A: {loss_choose_a:.5f}")

if loss_choose_b < loss_choose_a:
    print("Recommend: Deploy variant B (lower expected loss)")
else:
    print("Recommend: Keep control A (lower expected loss)")

Step 6: Frequentist vs. Bayesian — When to Use Each

Neither approach is universally better. The right choice depends on your context:

FactorFrequentistBayesian
Question answered"Is the difference statistically significant?""What is the probability B is better?"
Sample sizeMust be fixed in advanceCan monitor continuously
Outputp-value + confidence intervalFull posterior distribution
Prior knowledgeNot usedEncoded as a prior distribution
Ease of explanationHarder (p-value is often misinterpreted)Easier ("94% chance B is better")
Computational costInstantaneousRequires MCMC sampling (seconds to minutes)
Regulatory acceptanceWidely accepted in clinical trials, financeGrowing acceptance, less established

In practice, many teams run both — and that's totally fine. The frequentist test gives you a quick pass/fail gate, while the Bayesian model helps you quantify how much better the winning variant actually is.

Common A/B Testing Pitfalls

Even experienced data scientists fall into these traps. I've personally made most of them at least once.

1. Peeking at Results Too Early

Checking your frequentist test every day and stopping as soon as p < 0.05 dramatically inflates your false positive rate. It's tempting — you're excited about the test — but don't do it. If you need to monitor results continuously, either use a sequential testing framework (like the spending function approach in statsmodels) or switch to Bayesian methods which handle continuous monitoring naturally.

2. Underpowered Tests

Running a test with too few users means you probably won't detect a real effect even if it exists. Always run a power analysis first. If the required sample size is unrealistic for your traffic, increase the minimum detectable effect — don't just run the test anyway and hope for the best.

3. Multiple Comparisons Without Correction

Testing five metrics at alpha = 0.05 means roughly a 23% chance of at least one false positive. That's way too high. Use Bonferroni correction (divide alpha by the number of tests) or the Benjamini-Hochberg procedure to control the false discovery rate:

from statsmodels.stats.multitest import multipletests

# Suppose you tested 4 metrics and got these p-values
p_values = [0.03, 0.12, 0.007, 0.41]

rejected, corrected_p, _, _ = multipletests(p_values, method="fdr_bh")
for i, (orig, corr, rej) in enumerate(zip(p_values, corrected_p, rejected)):
    status = "Significant" if rej else "Not significant"
    print(f"Metric {i+1}: p={orig:.3f}, corrected p={corr:.3f}, {status}")

4. Simpson's Paradox and Confounders

If your randomization is broken — for example, more mobile users end up in the variant group — the observed difference might be due to the confounder, not your change. Always check that key user segments are balanced across groups before analyzing results. A quick crosstab by device type, geography, and new-vs-returning users can save you from a misleading conclusion.

5. Novelty and Primacy Effects

Users may interact differently with a new variant simply because it's new (novelty effect), or they may stick with what they know (primacy effect). Run your test for at least one to two full business cycles (typically two weeks) to wash out these effects.

Putting It All Together: A Reusable A/B Test Function

Here's a helper function that runs both frequentist and Bayesian analyses on a binary outcome dataset. Drop it into your project's utils and you've got a one-call solution:

def analyze_ab_test(control_conversions, variant_conversions, alpha=0.05):
    """Run frequentist + Bayesian analysis on binary A/B test data.

    Parameters
    ----------
    control_conversions : array-like of 0/1
    variant_conversions : array-like of 0/1
    alpha : significance level for frequentist test

    Returns
    -------
    dict with results from both approaches
    """
    # --- Frequentist ---
    n_c, n_v = len(control_conversions), len(variant_conversions)
    s_c, s_v = sum(control_conversions), sum(variant_conversions)
    successes = np.array([s_c, s_v])
    nobs = np.array([n_c, n_v])

    z_stat, p_val = proportions_ztest(successes, nobs, alternative="two-sided")

    # --- Bayesian ---
    with pm.Model():
        p_c = pm.Beta("p_control", alpha=2, beta=18)
        p_v = pm.Beta("p_variant", alpha=2, beta=18)
        pm.Bernoulli("obs_c", p=p_c, observed=control_conversions)
        pm.Bernoulli("obs_v", p=p_v, observed=variant_conversions)
        delta = pm.Deterministic("delta", p_v - p_c)
        trace = pm.sample(2000, tune=1000, random_seed=42,
                          return_inferencedata=True, progressbar=False)

    delta_samples = trace.posterior["delta"].values.flatten()
    prob_b_better = (delta_samples > 0).mean()

    return {
        "control_rate": s_c / n_c,
        "variant_rate": s_v / n_v,
        "z_statistic": z_stat,
        "p_value": p_val,
        "frequentist_significant": p_val < alpha,
        "prob_variant_better": prob_b_better,
        "expected_lift": delta_samples.mean(),
        "lift_95_hdi": az.hdi(delta_samples, hdi_prob=0.95),
    }

# Usage
results = analyze_ab_test(control_data, variant_data)
for k, v in results.items():
    print(f"{k}: {v}")

Frequently Asked Questions

How long should I run an A/B test?

Run the test until you reach the sample size determined by your power analysis. As a rule of thumb, never run for less than one full business cycle (usually one to two weeks) to capture day-of-week effects. Stopping early because results "look significant" inflates your false positive rate in a frequentist framework.

Can I use a t-test instead of a z-test for conversion rates?

Technically yes — for large samples the results are nearly identical. However, the two-proportion z-test is the standard choice for binary outcomes because it's purpose-built for comparing proportions, handles the variance estimation correctly under the null hypothesis, and avoids confusion about degrees of freedom.

What sample size do I need for an A/B test?

It depends on three things: your baseline metric value, the minimum improvement you want to detect (effect size), and your chosen significance level and power. Use the power analysis code in Step 2 to calculate the exact number. As a rough benchmark, detecting a 1 percentage-point lift from a 10% baseline at 80% power typically requires around 14,000 users per group.

Is Bayesian A/B testing better than frequentist?

Neither is inherently better — they answer different questions. Bayesian methods give you a direct probability that one variant wins and let you monitor results continuously without penalty. Frequentist methods are simpler to implement, computationally cheaper, and more widely accepted in regulated industries. Many teams use both: a frequentist test for the go/no-go decision, and a Bayesian model to quantify the magnitude of the lift.

How do I handle multiple variants (A/B/C/n testing)?

For frequentist analysis, use ANOVA or a chi-square test across all groups, then apply a post-hoc correction (like Tukey's HSD or Bonferroni) for pairwise comparisons. For Bayesian analysis, just add more group parameters to your PyMC model — the posterior will give you pairwise win probabilities for every combination automatically. It's one of the areas where the Bayesian approach really shines.

About the Author Editorial Team

Our team of expert writers and editors.