Introduction: Why Hypothesis Testing Matters for Data Scientists
Hypothesis testing is, honestly, the backbone of statistical inference — and if you've been doing data work for any length of time, you've probably run into it more often than you'd like. Whether you're A/B testing a new feature, comparing treatment groups in a clinical study, or just sanity-checking assumptions before training a model, hypothesis testing gives you a rigorous way to make decisions backed by data instead of gut feeling.
Python's scipy.stats module (now at version 1.17 as of early 2026) has battle-tested implementations of every major test you'll need. And the best part? Most of them are one-liners.
In this guide, you'll learn how to pick the right test, check your assumptions, run the analysis, and interpret results — all with production-ready code you can drop straight into your own projects.
How Hypothesis Testing Works: A 60-Second Primer
Every hypothesis test follows the same four-step recipe:
- State your hypotheses. The null hypothesis (H₀) represents the status quo — no effect, no difference. The alternative hypothesis (H₁) is what you actually want evidence for.
- Choose a significance level (α). Typically 0.05, meaning you're okay with a 5% chance of a false positive (Type I error).
- Compute a test statistic and p-value. The test statistic measures how far your sample result strays from H₀. The p-value tells you the probability of seeing a result at least this extreme if H₀ were actually true.
- Make a decision. If
p-value ≤ α, reject H₀. Otherwise, you fail to reject H₀ (and no, that doesn't mean H₀ is true — it just means you don't have enough evidence against it).
Simple enough on paper. The tricky part is knowing which test to use.
How to Choose the Right Statistical Test
The test you need depends on your data types and how many groups you're comparing. I've wasted more time than I'd like to admit picking the wrong test and having to redo an analysis, so here's a decision table to save you from that:
| Scenario | Data Types | Parametric Test | Non-Parametric Alternative |
|---|---|---|---|
| One sample mean vs. known value | Numeric | One-sample t-test | Wilcoxon signed-rank |
| Two independent group means | Numeric | Two-sample t-test | Mann-Whitney U |
| Two paired/related measurements | Numeric | Paired t-test | Wilcoxon signed-rank |
| Three or more independent group means | Numeric | One-way ANOVA | Kruskal-Wallis H |
| Association between two categorical variables | Categorical | Chi-square test of independence | Fisher's exact test |
| Observed vs. expected frequencies | Categorical | Chi-square goodness-of-fit | — |
Bookmark this. Seriously.
Setting Up Your Environment
All examples here use SciPy 1.17 (released January 2026) and pandas 3.x. Install or upgrade with:
pip install --upgrade scipy pandas numpy statsmodels
Then import the essentials:
import numpy as np
import pandas as pd
from scipy import stats
That's it. You're ready to go.
Checking Assumptions Before You Test
Here's something that trips up a lot of people: parametric tests (t-tests, ANOVA) assume your data meets certain conditions. If you violate those assumptions, your p-values can be completely unreliable. Always check before you run the test — it takes thirty seconds and can save you from embarrassing results.
Normality: Shapiro-Wilk Test
The Shapiro-Wilk test checks whether a sample comes from a normal distribution. A p-value above 0.05 suggests normality is a reasonable assumption.
data = np.random.normal(loc=50, scale=10, size=100)
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk statistic: {stat:.4f}, p-value: {p:.4f}")
if p > 0.05:
print("Normality assumption holds (fail to reject H0)")
else:
print("Data is not normally distributed -- consider a non-parametric test")
One caveat: for larger samples (n > 5000), Shapiro-Wilk can get overly sensitive — it'll flag tiny deviations from normality that don't actually matter in practice. In that case, use visual inspection (Q-Q plots) alongside the Kolmogorov-Smirnov test (stats.kstest).
Equal Variances: Levene's Test
The standard two-sample t-test and one-way ANOVA assume all groups share similar variance. Levene's test is robust to non-normality, which makes it the go-to choice here.
group_a = np.random.normal(50, 10, 60)
group_b = np.random.normal(52, 15, 60)
stat, p = stats.levene(group_a, group_b)
print(f"Levene statistic: {stat:.4f}, p-value: {p:.4f}")
if p > 0.05:
print("Equal variance assumption holds")
else:
print("Variances differ -- use Welch's t-test (equal_var=False)")
One-Sample t-Test
Use this when you want to compare the mean of a single sample against a known or hypothesized value. A classic example: "Is the average response time of our API actually 200 ms, or are we kidding ourselves?"
# Simulated API response times (ms)
response_times = np.array([198, 205, 210, 195, 202, 207, 199, 203, 211, 197,
206, 200, 194, 208, 201, 204, 196, 209, 203, 198])
# H0: mean = 200ms | H1: mean != 200ms
result = stats.ttest_1samp(response_times, popmean=200)
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value: {result.pvalue:.4f}")
# With confidence interval (SciPy 1.17)
ci = result.confidence_interval(confidence_level=0.95)
print(f"95% CI for mean: [{ci.low:.2f}, {ci.high:.2f}]")
If the p-value comes in below 0.05, you've got evidence that the true mean response time differs from 200 ms. Time to have a chat with the backend team.
Two-Sample (Independent) t-Test
This one's probably the most common test you'll run in practice. It compares the means of two independent groups — think A/B testing: "Did the new checkout flow actually increase average order value?"
# A/B test: order values ($) for control vs. treatment
control = np.array([45, 52, 48, 51, 47, 50, 46, 53, 49, 44,
55, 43, 50, 48, 52, 47, 51, 46, 54, 49])
treatment = np.array([50, 55, 53, 58, 52, 56, 54, 57, 51, 59,
53, 56, 55, 52, 58, 54, 57, 53, 56, 55])
# Step 1: Check equal variance
_, lev_p = stats.levene(control, treatment)
# Step 2: Run the appropriate t-test
equal_var = lev_p > 0.05
result = stats.ttest_ind(control, treatment, equal_var=equal_var)
test_type = "Student's" if equal_var else "Welch's"
print(f"{test_type} t-test")
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value: {result.pvalue:.4f}")
Tip: In SciPy 1.15+, ttest_ind also accepts a method parameter for resampling-based p-values via PermutationMethod or MonteCarloMethod. These are handy when you can't assume normality and don't want to reach for a separate non-parametric test.
Paired t-Test
Use this when you have two measurements on the same subjects — before-and-after scores from a training program, for instance. The key difference from the independent t-test is that paired data controls for individual variation, which often gives you more statistical power.
# Employee productivity scores: before and after training
before = np.array([72, 68, 75, 80, 65, 70, 78, 74, 69, 77])
after = np.array([78, 74, 80, 85, 70, 76, 82, 79, 73, 83])
result = stats.ttest_rel(before, after)
print(f"Paired t-statistic: {result.statistic:.4f}")
print(f"p-value: {result.pvalue:.4f}")
mean_diff = np.mean(after - before)
print(f"Mean improvement: {mean_diff:.1f} points")
One-Way ANOVA
When you need to compare means across three or more independent groups, ANOVA is what you want. It tells you whether at least one group is different, but not which ones — you'll need a post-hoc test for that (more on this in a moment).
# Conversion rates (%) for three landing page designs
design_a = np.array([4.2, 3.8, 4.5, 4.1, 3.9, 4.3, 4.0, 4.4, 3.7, 4.2])
design_b = np.array([5.1, 5.3, 4.9, 5.5, 5.0, 5.2, 4.8, 5.4, 5.1, 5.3])
design_c = np.array([4.5, 4.8, 4.3, 4.7, 4.6, 4.4, 4.9, 4.5, 4.7, 4.6])
# Check equal variances first
_, lev_p = stats.levene(design_a, design_b, design_c)
print(f"Levene p-value: {lev_p:.4f}")
# One-way ANOVA
f_stat, p_value = stats.f_oneway(design_a, design_b, design_c)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Post-Hoc Analysis with Tukey's HSD
So ANOVA told you something's different. Great — but which groups? That's where Tukey's Honestly Significant Difference test comes in. SciPy provides it via stats.tukey_hsd:
tukey = stats.tukey_hsd(design_a, design_b, design_c)
print(tukey)
Alternatively, statsmodels offers pairwise_tukeyhsd with a DataFrame-friendly interface, which I personally prefer for anything going into a report:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Combine data into a single DataFrame
data = np.concatenate([design_a, design_b, design_c])
labels = ["A"]*10 + ["B"]*10 + ["C"]*10
tukey_result = pairwise_tukeyhsd(data, labels, alpha=0.05)
print(tukey_result)
Chi-Square Test of Independence
Use the chi-square test when both of your variables are categorical. For example: "Is there a relationship between marketing channel and whether someone converts?"
# Contingency table: rows = channel, cols = converted (Yes/No)
observed = np.array([
[120, 80], # Email
[95, 105], # Social
[110, 90], # Organic
])
row_labels = ["Email", "Social", "Organic"]
col_labels = ["Converted", "Not Converted"]
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"p-value: {p_value:.4f}")
print(f"\nExpected frequencies:\n{pd.DataFrame(expected, index=row_labels, columns=col_labels).round(1)}")
When to use Fisher's exact test instead: If any expected cell count drops below 5, the chi-square approximation starts to break down. Use stats.fisher_exact for 2x2 tables, or a permutation-based approach for larger ones.
Chi-Square Goodness-of-Fit Test
This variant checks whether observed frequencies match an expected distribution. Think: "Are website visits actually evenly distributed across weekdays, or is Friday really that much busier?"
# Observed visits per weekday
observed_visits = np.array([180, 165, 190, 175, 210, 140, 140])
weekdays = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
# H0: visits are uniformly distributed
chi2, p_value = stats.chisquare(observed_visits)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
# Custom expected distribution (e.g., 20% weekdays, 10% weekends)
expected_pct = np.array([0.16, 0.16, 0.16, 0.16, 0.16, 0.10, 0.10])
expected_counts = expected_pct * observed_visits.sum()
chi2_custom, p_custom = stats.chisquare(observed_visits, f_exp=expected_counts)
print(f"\nWith custom expected distribution:")
print(f"Chi-square: {chi2_custom:.4f}, p-value: {p_custom:.4f}")
Non-Parametric Tests: When Assumptions Fail
Sometimes your data just doesn't cooperate. Maybe it's heavily skewed, full of outliers, or measured on an ordinal scale. That's where non-parametric tests come in — they make fewer assumptions about the underlying distribution, which makes them more robust (if slightly less powerful when parametric assumptions actually hold).
Mann-Whitney U Test (Two Independent Groups)
This is the non-parametric alternative to the two-sample t-test. Instead of comparing means directly, it compares the rank distributions of two independent samples.
# Customer satisfaction scores (1-10 scale) for two stores
store_a = np.array([7, 8, 6, 9, 5, 7, 8, 6, 7, 8])
store_b = np.array([8, 9, 7, 10, 8, 9, 7, 9, 8, 10])
u_stat, p_value = stats.mannwhitneyu(store_a, store_b, alternative="two-sided")
print(f"Mann-Whitney U statistic: {u_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Wilcoxon Signed-Rank Test (Two Paired Groups)
The non-parametric cousin of the paired t-test. Use it when you've got paired observations but can't assume normality of the differences.
# Pain scores before and after medication
before_med = np.array([8, 7, 9, 6, 8, 7, 9, 8, 7, 6])
after_med = np.array([5, 4, 6, 4, 5, 3, 6, 5, 4, 3])
w_stat, p_value = stats.wilcoxon(before_med, after_med)
print(f"Wilcoxon W statistic: {w_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Kruskal-Wallis H Test (Three or More Independent Groups)
Think of this as the non-parametric version of one-way ANOVA. You've got three or more groups and can't assume normality? Kruskal-Wallis is your friend.
# User engagement scores across three app versions
version_1 = np.array([65, 70, 68, 72, 66, 69, 71, 67])
version_2 = np.array([75, 78, 74, 80, 76, 77, 73, 79])
version_3 = np.array([70, 73, 71, 74, 69, 72, 70, 75])
h_stat, p_value = stats.kruskal(version_1, version_2, version_3)
print(f"Kruskal-Wallis H statistic: {h_stat:.4f}")
print(f"p-value: {p_value:.4f}")
If significant, follow up with pairwise Mann-Whitney U tests and apply a Bonferroni correction to control for multiple comparisons.
What's New in SciPy 1.17 for Hypothesis Testing
SciPy 1.17 (released January 2026) brought some genuinely useful improvements for anyone doing statistical work:
- Vectorized hypothesis tests: Functions like
levene,kruskal,friedmanchisquare,cramervonmises, andmoodare now vectorized. If you're running batches of tests on multi-dimensional input, you'll notice a significant speedup. - Expanded Array API support: Tests like
mannwhitneyu,wilcoxon,permutation_test, andbootstrapnow support JAX, CuPy, and Dask arrays — so you can run GPU-accelerated or distributed hypothesis tests. Pretty cool if you're dealing with massive datasets. - Improved accuracy: Updated critical value tables for
stats.andersonand reduced error rates for F-distribution and t-distribution functions. - New:
stats.chatterjeexi(added in 1.15): Computes the Xi correlation coefficient that detects nonlinear dependence — a modern alternative to Pearson's r that doesn't assume a linear relationship.
Complete Workflow: End-to-End Example
Let's tie everything together with a realistic scenario. Say you work at an e-commerce company and want to know whether three different recommendation algorithms produce different click-through rates. Here's how you'd approach it from start to finish:
import numpy as np
import pandas as pd
from scipy import stats
# Simulated daily CTR data for three algorithms (30 days each)
np.random.seed(42)
algo_baseline = np.random.normal(loc=3.2, scale=0.8, size=30)
algo_collab = np.random.normal(loc=3.8, scale=0.9, size=30)
algo_hybrid = np.random.normal(loc=3.5, scale=0.7, size=30)
# Step 1: Check normality for each group
for name, data in [("Baseline", algo_baseline),
("Collaborative", algo_collab),
("Hybrid", algo_hybrid)]:
_, p = stats.shapiro(data)
status = "OK" if p > 0.05 else "FAIL"
print(f" {name}: Shapiro p={p:.4f} [{status}]")
# Step 2: Check equal variances
_, lev_p = stats.levene(algo_baseline, algo_collab, algo_hybrid)
print(f"\nLevene p-value: {lev_p:.4f}")
# Step 3: Run one-way ANOVA
f_stat, anova_p = stats.f_oneway(algo_baseline, algo_collab, algo_hybrid)
print(f"ANOVA F={f_stat:.4f}, p={anova_p:.4f}")
# Step 4: Post-hoc Tukey HSD (if ANOVA is significant)
if anova_p < 0.05:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
all_data = np.concatenate([algo_baseline, algo_collab, algo_hybrid])
labels = ["Baseline"]*30 + ["Collaborative"]*30 + ["Hybrid"]*30
tukey = pairwise_tukeyhsd(all_data, labels, alpha=0.05)
print(f"\nTukey HSD Results:\n{tukey}")
else:
print("No significant difference found -- no post-hoc needed.")
That's four steps, maybe two minutes of work. And now you've got a statistically rigorous answer for your stakeholders instead of just eyeballing a chart.
Common Pitfalls and How to Avoid Them
I've seen (and made) all of these mistakes. Learn from them:
- Multiple comparisons problem: Running many tests inflates your overall false-positive rate. Apply corrections like Bonferroni (α / n_tests) or use
stats.false_discovery_controlfor the Benjamini-Hochberg procedure. - Confusing statistical significance with practical significance: A p-value of 0.001 with a trivial effect size isn't actionable. Always report effect sizes (Cohen's d, eta-squared) alongside p-values. Your PM doesn't care that the result is "statistically significant" if the actual difference is negligible.
- Ignoring assumptions: Running a t-test on heavily skewed data gives misleading results. Check normality and variance homogeneity first, or switch to non-parametric alternatives.
- Small sample sizes: With very small samples (n < 10), even real effects may go undetected. Consider power analysis (
statsmodels.stats.power) to figure out the sample size you actually need before collecting data. - P-hacking: Never run a bunch of tests and only report the significant ones. That's not science — it's data torture. Pre-register your hypotheses or use Bonferroni/BH corrections for exploratory analyses.
Quick Reference: SciPy Functions for Hypothesis Testing
| Test | SciPy Function | Use Case |
|---|---|---|
| One-sample t-test | stats.ttest_1samp() | Sample mean vs. known value |
| Two-sample t-test | stats.ttest_ind() | Two independent group means |
| Welch's t-test | stats.ttest_ind(equal_var=False) | Unequal variances |
| Paired t-test | stats.ttest_rel() | Before/after on same subjects |
| One-way ANOVA | stats.f_oneway() | 3+ independent group means |
| Tukey HSD | stats.tukey_hsd() | Post-hoc pairwise comparisons |
| Chi-square independence | stats.chi2_contingency() | Two categorical variables |
| Chi-square goodness-of-fit | stats.chisquare() | Observed vs. expected frequencies |
| Fisher's exact test | stats.fisher_exact() | 2x2 tables with small counts |
| Mann-Whitney U | stats.mannwhitneyu() | Non-parametric two-group comparison |
| Wilcoxon signed-rank | stats.wilcoxon() | Non-parametric paired comparison |
| Kruskal-Wallis H | stats.kruskal() | Non-parametric 3+ groups |
| Shapiro-Wilk | stats.shapiro() | Normality check |
| Levene's test | stats.levene() | Variance homogeneity |
Frequently Asked Questions
What is the difference between a p-value and a significance level?
The significance level (α) is a threshold you set before running the test — typically 0.05. The p-value is computed after the test and represents the probability of observing your results (or more extreme) if the null hypothesis were true. If the p-value is less than or equal to α, you reject the null hypothesis. Think of α as the bar you set, and the p-value as where the data lands relative to that bar.
When should I use a t-test vs. a z-test?
Use a z-test when you know the population standard deviation and have a large sample (n > 30). In practice, though, you almost never know the population standard deviation, so the t-test is the standard choice. For large samples the t-distribution converges to the normal distribution anyway, making the results virtually identical.
What do I do if my data is not normally distributed?
You've got two solid options. First, if your sample is large enough (n > 30), the Central Limit Theorem means t-tests and ANOVA are still reasonably robust — they can handle moderate departures from normality. Second, switch to a non-parametric test: Mann-Whitney U instead of a two-sample t-test, Wilcoxon signed-rank instead of a paired t-test, or Kruskal-Wallis instead of ANOVA. SciPy 1.15+ also lets you pass PermutationMethod to ttest_ind for a resampling-based p-value, which is a nice middle ground.
How do I handle multiple comparisons?
Running multiple hypothesis tests inflates the probability of at least one false positive. The simplest correction is Bonferroni: divide your α by the number of tests. For a less conservative approach, use the Benjamini-Hochberg procedure via scipy.stats.false_discovery_control, which controls the false discovery rate instead of the family-wise error rate. In most exploratory data analysis, Benjamini-Hochberg is what you want.
Can I use hypothesis testing for feature selection in machine learning?
Absolutely. For numeric features, you can use ANOVA F-tests (sklearn.feature_selection.f_classif) to rank features by their relationship with a categorical target. For categorical features, use chi-square tests (sklearn.feature_selection.chi2). These are fast, interpretable filters that work well as a first pass before more advanced methods like mutual information or model-based selection.