Hypothesis Testing in Python: t-Tests, Chi-Square, ANOVA, and Non-Parametric Tests with SciPy

A practical, code-driven guide to hypothesis testing in Python using SciPy 1.17. Covers t-tests, chi-square, ANOVA, Mann-Whitney U, and Kruskal-Wallis with working examples, assumption checking, and a decision framework for choosing the right test.

Introduction: Why Hypothesis Testing Matters for Data Scientists

Hypothesis testing is, honestly, the backbone of statistical inference — and if you've been doing data work for any length of time, you've probably run into it more often than you'd like. Whether you're A/B testing a new feature, comparing treatment groups in a clinical study, or just sanity-checking assumptions before training a model, hypothesis testing gives you a rigorous way to make decisions backed by data instead of gut feeling.

Python's scipy.stats module (now at version 1.17 as of early 2026) has battle-tested implementations of every major test you'll need. And the best part? Most of them are one-liners.

In this guide, you'll learn how to pick the right test, check your assumptions, run the analysis, and interpret results — all with production-ready code you can drop straight into your own projects.

How Hypothesis Testing Works: A 60-Second Primer

Every hypothesis test follows the same four-step recipe:

  1. State your hypotheses. The null hypothesis (H₀) represents the status quo — no effect, no difference. The alternative hypothesis (H₁) is what you actually want evidence for.
  2. Choose a significance level (α). Typically 0.05, meaning you're okay with a 5% chance of a false positive (Type I error).
  3. Compute a test statistic and p-value. The test statistic measures how far your sample result strays from H₀. The p-value tells you the probability of seeing a result at least this extreme if H₀ were actually true.
  4. Make a decision. If p-value ≤ α, reject H₀. Otherwise, you fail to reject H₀ (and no, that doesn't mean H₀ is true — it just means you don't have enough evidence against it).

Simple enough on paper. The tricky part is knowing which test to use.

How to Choose the Right Statistical Test

The test you need depends on your data types and how many groups you're comparing. I've wasted more time than I'd like to admit picking the wrong test and having to redo an analysis, so here's a decision table to save you from that:

Scenario Data Types Parametric Test Non-Parametric Alternative
One sample mean vs. known value Numeric One-sample t-test Wilcoxon signed-rank
Two independent group means Numeric Two-sample t-test Mann-Whitney U
Two paired/related measurements Numeric Paired t-test Wilcoxon signed-rank
Three or more independent group means Numeric One-way ANOVA Kruskal-Wallis H
Association between two categorical variables Categorical Chi-square test of independence Fisher's exact test
Observed vs. expected frequencies Categorical Chi-square goodness-of-fit

Bookmark this. Seriously.

Setting Up Your Environment

All examples here use SciPy 1.17 (released January 2026) and pandas 3.x. Install or upgrade with:

pip install --upgrade scipy pandas numpy statsmodels

Then import the essentials:

import numpy as np
import pandas as pd
from scipy import stats

That's it. You're ready to go.

Checking Assumptions Before You Test

Here's something that trips up a lot of people: parametric tests (t-tests, ANOVA) assume your data meets certain conditions. If you violate those assumptions, your p-values can be completely unreliable. Always check before you run the test — it takes thirty seconds and can save you from embarrassing results.

Normality: Shapiro-Wilk Test

The Shapiro-Wilk test checks whether a sample comes from a normal distribution. A p-value above 0.05 suggests normality is a reasonable assumption.

data = np.random.normal(loc=50, scale=10, size=100)

stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk statistic: {stat:.4f}, p-value: {p:.4f}")

if p > 0.05:
    print("Normality assumption holds (fail to reject H0)")
else:
    print("Data is not normally distributed -- consider a non-parametric test")

One caveat: for larger samples (n > 5000), Shapiro-Wilk can get overly sensitive — it'll flag tiny deviations from normality that don't actually matter in practice. In that case, use visual inspection (Q-Q plots) alongside the Kolmogorov-Smirnov test (stats.kstest).

Equal Variances: Levene's Test

The standard two-sample t-test and one-way ANOVA assume all groups share similar variance. Levene's test is robust to non-normality, which makes it the go-to choice here.

group_a = np.random.normal(50, 10, 60)
group_b = np.random.normal(52, 15, 60)

stat, p = stats.levene(group_a, group_b)
print(f"Levene statistic: {stat:.4f}, p-value: {p:.4f}")

if p > 0.05:
    print("Equal variance assumption holds")
else:
    print("Variances differ -- use Welch's t-test (equal_var=False)")

One-Sample t-Test

Use this when you want to compare the mean of a single sample against a known or hypothesized value. A classic example: "Is the average response time of our API actually 200 ms, or are we kidding ourselves?"

# Simulated API response times (ms)
response_times = np.array([198, 205, 210, 195, 202, 207, 199, 203, 211, 197,
                           206, 200, 194, 208, 201, 204, 196, 209, 203, 198])

# H0: mean = 200ms  |  H1: mean != 200ms
result = stats.ttest_1samp(response_times, popmean=200)
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value:     {result.pvalue:.4f}")

# With confidence interval (SciPy 1.17)
ci = result.confidence_interval(confidence_level=0.95)
print(f"95% CI for mean: [{ci.low:.2f}, {ci.high:.2f}]")

If the p-value comes in below 0.05, you've got evidence that the true mean response time differs from 200 ms. Time to have a chat with the backend team.

Two-Sample (Independent) t-Test

This one's probably the most common test you'll run in practice. It compares the means of two independent groups — think A/B testing: "Did the new checkout flow actually increase average order value?"

# A/B test: order values ($) for control vs. treatment
control   = np.array([45, 52, 48, 51, 47, 50, 46, 53, 49, 44,
                       55, 43, 50, 48, 52, 47, 51, 46, 54, 49])
treatment = np.array([50, 55, 53, 58, 52, 56, 54, 57, 51, 59,
                       53, 56, 55, 52, 58, 54, 57, 53, 56, 55])

# Step 1: Check equal variance
_, lev_p = stats.levene(control, treatment)

# Step 2: Run the appropriate t-test
equal_var = lev_p > 0.05
result = stats.ttest_ind(control, treatment, equal_var=equal_var)

test_type = "Student's" if equal_var else "Welch's"
print(f"{test_type} t-test")
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value:     {result.pvalue:.4f}")

Tip: In SciPy 1.15+, ttest_ind also accepts a method parameter for resampling-based p-values via PermutationMethod or MonteCarloMethod. These are handy when you can't assume normality and don't want to reach for a separate non-parametric test.

Paired t-Test

Use this when you have two measurements on the same subjects — before-and-after scores from a training program, for instance. The key difference from the independent t-test is that paired data controls for individual variation, which often gives you more statistical power.

# Employee productivity scores: before and after training
before = np.array([72, 68, 75, 80, 65, 70, 78, 74, 69, 77])
after  = np.array([78, 74, 80, 85, 70, 76, 82, 79, 73, 83])

result = stats.ttest_rel(before, after)
print(f"Paired t-statistic: {result.statistic:.4f}")
print(f"p-value:            {result.pvalue:.4f}")

mean_diff = np.mean(after - before)
print(f"Mean improvement:   {mean_diff:.1f} points")

One-Way ANOVA

When you need to compare means across three or more independent groups, ANOVA is what you want. It tells you whether at least one group is different, but not which ones — you'll need a post-hoc test for that (more on this in a moment).

# Conversion rates (%) for three landing page designs
design_a = np.array([4.2, 3.8, 4.5, 4.1, 3.9, 4.3, 4.0, 4.4, 3.7, 4.2])
design_b = np.array([5.1, 5.3, 4.9, 5.5, 5.0, 5.2, 4.8, 5.4, 5.1, 5.3])
design_c = np.array([4.5, 4.8, 4.3, 4.7, 4.6, 4.4, 4.9, 4.5, 4.7, 4.6])

# Check equal variances first
_, lev_p = stats.levene(design_a, design_b, design_c)
print(f"Levene p-value: {lev_p:.4f}")

# One-way ANOVA
f_stat, p_value = stats.f_oneway(design_a, design_b, design_c)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

Post-Hoc Analysis with Tukey's HSD

So ANOVA told you something's different. Great — but which groups? That's where Tukey's Honestly Significant Difference test comes in. SciPy provides it via stats.tukey_hsd:

tukey = stats.tukey_hsd(design_a, design_b, design_c)
print(tukey)

Alternatively, statsmodels offers pairwise_tukeyhsd with a DataFrame-friendly interface, which I personally prefer for anything going into a report:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data into a single DataFrame
data = np.concatenate([design_a, design_b, design_c])
labels = ["A"]*10 + ["B"]*10 + ["C"]*10

tukey_result = pairwise_tukeyhsd(data, labels, alpha=0.05)
print(tukey_result)

Chi-Square Test of Independence

Use the chi-square test when both of your variables are categorical. For example: "Is there a relationship between marketing channel and whether someone converts?"

# Contingency table: rows = channel, cols = converted (Yes/No)
observed = np.array([
    [120, 80],   # Email
    [95,  105],  # Social
    [110, 90],   # Organic
])
row_labels = ["Email", "Social", "Organic"]
col_labels = ["Converted", "Not Converted"]

chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom:   {dof}")
print(f"p-value:              {p_value:.4f}")
print(f"\nExpected frequencies:\n{pd.DataFrame(expected, index=row_labels, columns=col_labels).round(1)}")

When to use Fisher's exact test instead: If any expected cell count drops below 5, the chi-square approximation starts to break down. Use stats.fisher_exact for 2x2 tables, or a permutation-based approach for larger ones.

Chi-Square Goodness-of-Fit Test

This variant checks whether observed frequencies match an expected distribution. Think: "Are website visits actually evenly distributed across weekdays, or is Friday really that much busier?"

# Observed visits per weekday
observed_visits = np.array([180, 165, 190, 175, 210, 140, 140])
weekdays = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

# H0: visits are uniformly distributed
chi2, p_value = stats.chisquare(observed_visits)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value:              {p_value:.4f}")

# Custom expected distribution (e.g., 20% weekdays, 10% weekends)
expected_pct = np.array([0.16, 0.16, 0.16, 0.16, 0.16, 0.10, 0.10])
expected_counts = expected_pct * observed_visits.sum()
chi2_custom, p_custom = stats.chisquare(observed_visits, f_exp=expected_counts)
print(f"\nWith custom expected distribution:")
print(f"Chi-square: {chi2_custom:.4f}, p-value: {p_custom:.4f}")

Non-Parametric Tests: When Assumptions Fail

Sometimes your data just doesn't cooperate. Maybe it's heavily skewed, full of outliers, or measured on an ordinal scale. That's where non-parametric tests come in — they make fewer assumptions about the underlying distribution, which makes them more robust (if slightly less powerful when parametric assumptions actually hold).

Mann-Whitney U Test (Two Independent Groups)

This is the non-parametric alternative to the two-sample t-test. Instead of comparing means directly, it compares the rank distributions of two independent samples.

# Customer satisfaction scores (1-10 scale) for two stores
store_a = np.array([7, 8, 6, 9, 5, 7, 8, 6, 7, 8])
store_b = np.array([8, 9, 7, 10, 8, 9, 7, 9, 8, 10])

u_stat, p_value = stats.mannwhitneyu(store_a, store_b, alternative="two-sided")
print(f"Mann-Whitney U statistic: {u_stat:.4f}")
print(f"p-value:                  {p_value:.4f}")

Wilcoxon Signed-Rank Test (Two Paired Groups)

The non-parametric cousin of the paired t-test. Use it when you've got paired observations but can't assume normality of the differences.

# Pain scores before and after medication
before_med = np.array([8, 7, 9, 6, 8, 7, 9, 8, 7, 6])
after_med  = np.array([5, 4, 6, 4, 5, 3, 6, 5, 4, 3])

w_stat, p_value = stats.wilcoxon(before_med, after_med)
print(f"Wilcoxon W statistic: {w_stat:.4f}")
print(f"p-value:              {p_value:.4f}")

Kruskal-Wallis H Test (Three or More Independent Groups)

Think of this as the non-parametric version of one-way ANOVA. You've got three or more groups and can't assume normality? Kruskal-Wallis is your friend.

# User engagement scores across three app versions
version_1 = np.array([65, 70, 68, 72, 66, 69, 71, 67])
version_2 = np.array([75, 78, 74, 80, 76, 77, 73, 79])
version_3 = np.array([70, 73, 71, 74, 69, 72, 70, 75])

h_stat, p_value = stats.kruskal(version_1, version_2, version_3)
print(f"Kruskal-Wallis H statistic: {h_stat:.4f}")
print(f"p-value:                    {p_value:.4f}")

If significant, follow up with pairwise Mann-Whitney U tests and apply a Bonferroni correction to control for multiple comparisons.

What's New in SciPy 1.17 for Hypothesis Testing

SciPy 1.17 (released January 2026) brought some genuinely useful improvements for anyone doing statistical work:

  • Vectorized hypothesis tests: Functions like levene, kruskal, friedmanchisquare, cramervonmises, and mood are now vectorized. If you're running batches of tests on multi-dimensional input, you'll notice a significant speedup.
  • Expanded Array API support: Tests like mannwhitneyu, wilcoxon, permutation_test, and bootstrap now support JAX, CuPy, and Dask arrays — so you can run GPU-accelerated or distributed hypothesis tests. Pretty cool if you're dealing with massive datasets.
  • Improved accuracy: Updated critical value tables for stats.anderson and reduced error rates for F-distribution and t-distribution functions.
  • New: stats.chatterjeexi (added in 1.15): Computes the Xi correlation coefficient that detects nonlinear dependence — a modern alternative to Pearson's r that doesn't assume a linear relationship.

Complete Workflow: End-to-End Example

Let's tie everything together with a realistic scenario. Say you work at an e-commerce company and want to know whether three different recommendation algorithms produce different click-through rates. Here's how you'd approach it from start to finish:

import numpy as np
import pandas as pd
from scipy import stats

# Simulated daily CTR data for three algorithms (30 days each)
np.random.seed(42)
algo_baseline  = np.random.normal(loc=3.2, scale=0.8, size=30)
algo_collab    = np.random.normal(loc=3.8, scale=0.9, size=30)
algo_hybrid    = np.random.normal(loc=3.5, scale=0.7, size=30)

# Step 1: Check normality for each group
for name, data in [("Baseline", algo_baseline),
                    ("Collaborative", algo_collab),
                    ("Hybrid", algo_hybrid)]:
    _, p = stats.shapiro(data)
    status = "OK" if p > 0.05 else "FAIL"
    print(f"  {name}: Shapiro p={p:.4f} [{status}]")

# Step 2: Check equal variances
_, lev_p = stats.levene(algo_baseline, algo_collab, algo_hybrid)
print(f"\nLevene p-value: {lev_p:.4f}")

# Step 3: Run one-way ANOVA
f_stat, anova_p = stats.f_oneway(algo_baseline, algo_collab, algo_hybrid)
print(f"ANOVA F={f_stat:.4f}, p={anova_p:.4f}")

# Step 4: Post-hoc Tukey HSD (if ANOVA is significant)
if anova_p < 0.05:
    from statsmodels.stats.multicomp import pairwise_tukeyhsd

    all_data = np.concatenate([algo_baseline, algo_collab, algo_hybrid])
    labels = ["Baseline"]*30 + ["Collaborative"]*30 + ["Hybrid"]*30

    tukey = pairwise_tukeyhsd(all_data, labels, alpha=0.05)
    print(f"\nTukey HSD Results:\n{tukey}")
else:
    print("No significant difference found -- no post-hoc needed.")

That's four steps, maybe two minutes of work. And now you've got a statistically rigorous answer for your stakeholders instead of just eyeballing a chart.

Common Pitfalls and How to Avoid Them

I've seen (and made) all of these mistakes. Learn from them:

  • Multiple comparisons problem: Running many tests inflates your overall false-positive rate. Apply corrections like Bonferroni (α / n_tests) or use stats.false_discovery_control for the Benjamini-Hochberg procedure.
  • Confusing statistical significance with practical significance: A p-value of 0.001 with a trivial effect size isn't actionable. Always report effect sizes (Cohen's d, eta-squared) alongside p-values. Your PM doesn't care that the result is "statistically significant" if the actual difference is negligible.
  • Ignoring assumptions: Running a t-test on heavily skewed data gives misleading results. Check normality and variance homogeneity first, or switch to non-parametric alternatives.
  • Small sample sizes: With very small samples (n < 10), even real effects may go undetected. Consider power analysis (statsmodels.stats.power) to figure out the sample size you actually need before collecting data.
  • P-hacking: Never run a bunch of tests and only report the significant ones. That's not science — it's data torture. Pre-register your hypotheses or use Bonferroni/BH corrections for exploratory analyses.

Quick Reference: SciPy Functions for Hypothesis Testing

Test SciPy Function Use Case
One-sample t-teststats.ttest_1samp()Sample mean vs. known value
Two-sample t-teststats.ttest_ind()Two independent group means
Welch's t-teststats.ttest_ind(equal_var=False)Unequal variances
Paired t-teststats.ttest_rel()Before/after on same subjects
One-way ANOVAstats.f_oneway()3+ independent group means
Tukey HSDstats.tukey_hsd()Post-hoc pairwise comparisons
Chi-square independencestats.chi2_contingency()Two categorical variables
Chi-square goodness-of-fitstats.chisquare()Observed vs. expected frequencies
Fisher's exact teststats.fisher_exact()2x2 tables with small counts
Mann-Whitney Ustats.mannwhitneyu()Non-parametric two-group comparison
Wilcoxon signed-rankstats.wilcoxon()Non-parametric paired comparison
Kruskal-Wallis Hstats.kruskal()Non-parametric 3+ groups
Shapiro-Wilkstats.shapiro()Normality check
Levene's teststats.levene()Variance homogeneity

Frequently Asked Questions

What is the difference between a p-value and a significance level?

The significance level (α) is a threshold you set before running the test — typically 0.05. The p-value is computed after the test and represents the probability of observing your results (or more extreme) if the null hypothesis were true. If the p-value is less than or equal to α, you reject the null hypothesis. Think of α as the bar you set, and the p-value as where the data lands relative to that bar.

When should I use a t-test vs. a z-test?

Use a z-test when you know the population standard deviation and have a large sample (n > 30). In practice, though, you almost never know the population standard deviation, so the t-test is the standard choice. For large samples the t-distribution converges to the normal distribution anyway, making the results virtually identical.

What do I do if my data is not normally distributed?

You've got two solid options. First, if your sample is large enough (n > 30), the Central Limit Theorem means t-tests and ANOVA are still reasonably robust — they can handle moderate departures from normality. Second, switch to a non-parametric test: Mann-Whitney U instead of a two-sample t-test, Wilcoxon signed-rank instead of a paired t-test, or Kruskal-Wallis instead of ANOVA. SciPy 1.15+ also lets you pass PermutationMethod to ttest_ind for a resampling-based p-value, which is a nice middle ground.

How do I handle multiple comparisons?

Running multiple hypothesis tests inflates the probability of at least one false positive. The simplest correction is Bonferroni: divide your α by the number of tests. For a less conservative approach, use the Benjamini-Hochberg procedure via scipy.stats.false_discovery_control, which controls the false discovery rate instead of the family-wise error rate. In most exploratory data analysis, Benjamini-Hochberg is what you want.

Can I use hypothesis testing for feature selection in machine learning?

Absolutely. For numeric features, you can use ANOVA F-tests (sklearn.feature_selection.f_classif) to rank features by their relationship with a categorical target. For categorical features, use chi-square tests (sklearn.feature_selection.chi2). These are fast, interpretable filters that work well as a first pass before more advanced methods like mutual information or model-based selection.

About the Author Editorial Team

Our team of expert writers and editors.