Python Data Drift Detection Guide 2026

Your model passed every offline test, hit production, and quietly started losing money three weeks later. Nothing crashed. Accuracy just slipped from 0.91 to 0.78 while the input distributions slid out from under it. That is the failure mode data drift detection exists to catch — and in 2026, three open-source Python libraries pretty much dominate the conversation: Evidently, NannyML, and Alibi Detect.

So, let's compare them on the dimensions that actually matter when you have to ship: detection methods, API ergonomics, runtime cost, label requirements, and how cleanly each plugs into a production monitoring pipeline. You'll also see runnable code for each tool against the same dataset — so you can copy a pattern that fits your stack instead of reading yet another vendor matrix.

Quick disclaimer: I've shipped all three of these in anger, and the "right" pick is rarely the one with the prettiest landing page.

Data Drift vs Concept Drift vs Label Drift

Before picking a tool, get the vocabulary right. Mixing these up is the single biggest source of bad monitoring dashboards (I've seen this kill more pager rotations than the drift itself).

Data drift (covariate shift): P(X) changes, P(Y|X) stays the same. Inflation pushes transaction amounts up, but the fraud-vs-legit relationship is unchanged. Detect with KS test, PSI, Jensen-Shannon, Wasserstein.
Concept drift: P(Y|X) changes — the relationship itself shifts. Post-COVID, features that predicted churn in 2019 simply stopped working. Detect with rolling performance metrics on labeled data, or with proxies like CBPE (NannyML's confidence-based performance estimation) when labels are delayed.
Label drift (prior shift): P(Y) changes. Fraud rate doubles overnight. Detect with a chi-square test on the prediction or label distribution.

A useful rule of thumb: if performance is degrading and features have drifted, you probably have data drift. If performance degrades but feature distributions look stable, suspect concept drift. Most production teams end up monitoring all three on the same dashboard anyway.

The Statistical Tests You'll See Everywhere

Every drift library is, under the hood, a wrapper around a small set of statistical tests. Knowing which test fires when will save you hours of debugging false positives.

Test	Best for	Threshold rule of thumb
Kolmogorov-Smirnov	Continuous features, small samples	p < 0.05
Population Stability Index (PSI)	Binned numeric or categorical, large samples	< 0.1 stable, 0.1-0.2 moderate, > 0.2 drift
Jensen-Shannon distance	Categorical, bounded [0,1]	> 0.1 worth investigating
Wasserstein distance	Continuous, captures magnitude not just shape	Domain-dependent, use rolling baseline
Chi-square	Categorical features and labels	p < 0.05
MMD (Maximum Mean Discrepancy)	High-dimensional, embeddings, images	permutation-based p-value

One thing to watch for: KS and chi-square become extremely sensitive at large sample sizes. A million-row reference window will flag statistically "significant" drift on essentially noise. PSI and Jensen-Shannon are more honest at scale because they measure effect size, not just statistical significance.

Setting Up a Shared Example

To make the comparison concrete, the rest of this article uses one synthetic credit-default dataset, split into a reference window (training-time data) and a current window (production data with injected drift). Run this once and reuse it across each library's example.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

rng = np.random.default_rng(42)

X, y = make_classification(
    n_samples=20_000, n_features=8, n_informative=5,
    weights=[0.85, 0.15], random_state=42,
)
cols = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=cols)
df["target"] = y

reference = df.iloc[:10_000].copy()
current = df.iloc[10_000:].copy()

# Inject realistic drift: shift two features, rescale a third
current["feature_0"] = current["feature_0"] + 0.7
current["feature_3"] = current["feature_3"] * 1.4
current["feature_5"] = current["feature_5"] + rng.normal(0.3, 0.1, len(current))

reference.to_parquet("reference.parquet")
current.to_parquet("current.parquet")

Evidently: The Batteries-Included Default

Evidently is the most popular open-source choice in 2026, and honestly, it's not hard to see why — it covers the broadest surface area: tabular data drift, target drift, prediction drift, model performance, text and embedding drift, plus an LLM evaluation suite. It generates HTML reports out of the box and exposes JSON for ingestion into Grafana or any alerting stack.

from evidently import Report
from evidently.presets import DataDriftPreset
import pandas as pd

reference = pd.read_parquet("reference.parquet").drop(columns=["target"])
current = pd.read_parquet("current.parquet").drop(columns=["target"])

report = Report(metrics=[DataDriftPreset()])
snapshot = report.run(reference_data=reference, current_data=current)

# Inspect programmatically
result = snapshot.dict()
print(f"Dataset drift detected: {result['metrics'][0]['value']['dataset_drift']}")
print(f"Drifted columns: {result['metrics'][0]['value']['number_of_drifted_columns']}")

# Export an HTML report for stakeholders
snapshot.save_html("drift_report.html")

Evidently auto-selects the right test per column type (KS for numeric, chi-square for categorical) and lets you override the threshold per feature. For monitoring at scale, pair it with Evidently Cloud or push the JSON snapshots into a Postgres-backed dashboard.

The library's biggest strength is also its weakness: defaults are too permissive at large sample sizes. Always sanity-check by setting stattest="psi" with stattest_threshold=0.2 on noisy features — otherwise you'll be fighting alert fatigue inside a week.

Catching Prediction Drift Without Labels

When ground truth is delayed, prediction drift is your earliest warning signal. Evidently supports this directly:

from evidently import Report
from evidently.presets import TargetDriftPreset

report = Report(metrics=[TargetDriftPreset(columns=["prediction"])])
snapshot = report.run(reference_data=reference_with_preds,
                      current_data=current_with_preds)

NannyML: Performance Estimation Without Labels

NannyML's killer feature is CBPE (Confidence-Based Performance Estimation) and its multivariate cousin DLE (Direct Loss Estimation). These estimate model accuracy, AUC, or RMSE in production before labels arrive — which is exactly the problem most teams solve badly with vague "drift score" alerts.

import nannyml as nml
import pandas as pd

reference = pd.read_parquet("reference.parquet")
current = pd.read_parquet("current.parquet")

# Add timestamp + predictions (simulate scoring)
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(
    reference.drop(columns=["target"]), reference["target"]
)
for df_ in (reference, current):
    df_["y_pred_proba"] = model.predict_proba(df_.drop(columns=["target"]))[:, 1]
    df_["y_pred"] = (df_["y_pred_proba"] > 0.5).astype(int)
    df_["timestamp"] = pd.date_range("2026-01-01", periods=len(df_), freq="min")

feature_cols = [c for c in reference.columns if c.startswith("feature_")]

# Multivariate drift via PCA reconstruction error
calc = nml.DataReconstructionDriftCalculator(
    column_names=feature_cols,
    timestamp_column_name="timestamp",
    chunk_size=1000,
).fit(reference)
mv_results = calc.calculate(current)
mv_results.plot().show()

# Estimate AUC without labels
estimator = nml.CBPE(
    y_pred_proba="y_pred_proba",
    y_pred="y_pred",
    y_true="target",
    metrics=["roc_auc"],
    chunk_size=1000,
    problem_type="classification_binary",
).fit(reference)
perf = estimator.estimate(current)
print(perf.to_df()[["chunk", "estimated_roc_auc", "alert"]].head())

Two NannyML patterns are worth memorizing. First, the multivariate DataReconstructionDriftCalculator catches changes in feature correlations that univariate tests miss entirely — a model can have every feature stable in isolation while their joint distribution has quietly shifted under you. Second, CBPE's alert column gives you a deployable signal: trigger retraining workflows on it, not on raw drift scores.

NannyML is the right tool when (a) ground-truth labels arrive days or weeks late, and (b) you need a number you can put in front of stakeholders that says "your model accuracy is now estimated at 0.74 ± 0.03." That second part is genuinely underrated — try defending a retraining ticket with "the KS p-value dropped" sometime.

Alibi Detect: Advanced Detectors and Multi-Modal Data

Alibi Detect, maintained by Seldon, is the heavyweight option. It ships dozens of detectors — KS, MMD, Chi-square, Fisher's Exact Test, Learned Kernel MMD, Classifier Drift, Spot-the-Diff, plus dedicated detectors for images, text, and time series. If you're monitoring embeddings, model logits, or anything beyond a tabular DataFrame, this is usually the right answer.

import numpy as np
import pandas as pd
from alibi_detect.cd import KSDrift, MMDDrift, ClassifierDrift

reference = pd.read_parquet("reference.parquet").drop(columns=["target"]).to_numpy()
current = pd.read_parquet("current.parquet").drop(columns=["target"]).to_numpy()

# Univariate KS test per feature with Bonferroni correction
ks = KSDrift(reference, p_val=0.05, correction="bonferroni")
print(ks.predict(current))

# Multivariate MMD with permutation test
mmd = MMDDrift(reference, p_val=0.05, n_permutations=100, backend="pytorch")
print(mmd.predict(current))

# Classifier-based drift: train a model to distinguish ref from current
clf = ClassifierDrift(
    reference, model=None, backend="sklearn",
    p_val=0.05, n_folds=5, preprocess_at_init=True,
)
print(clf.predict(current))

The ClassifierDrift approach is, in my opinion, one of the most underrated detectors in the entire ecosystem. The intuition is dead simple: if a binary classifier can reliably distinguish reference rows from current rows, the distributions have drifted. The cross-validated AUC is the drift magnitude, and feature importances tell you exactly which columns moved. Beautiful.

The downside, confirmed in the 2024 D3Bench microbenchmark, is runtime. Alibi Detect's MMD and learned-kernel detectors can run two orders of magnitude slower than NannyML or Evidently on the same data. For batch monitoring this is fine; for low-latency online detection, stick to KS or Chi-square detectors.

Head-to-Head Comparison

Capability	Evidently	NannyML	Alibi Detect
Univariate drift tests	20+ built-in	KS, Chi-square, JS, Wasserstein, L-infinity, Hellinger	KS, Chi-square, Fisher, MMD, LSDD, Cramér-von Mises
Multivariate drift	Domain classifier	PCA reconstruction error, domain classifier	MMD, Learned Kernel MMD, Classifier Drift, Spot-the-Diff
Performance estimation without labels	No	Yes (CBPE, DLE, M-CBPE)	No
Image / text / embedding drift	Embeddings only	Tabular only	Native support
Online / streaming detectors	No	No	Yes (Online MMD, Online LSDD)
HTML reports / dashboards	Excellent	Plotly plots	Minimal — bring your own
Runtime on tabular data	Fast	Fastest	Slowest (10-100×)
License	Apache 2.0	Apache 2.0	Apache 2.0
Best for	General-purpose monitoring, stakeholder reports	Pinpointing drift timing, label-free performance	Multi-modal data, advanced detectors, online streams

Which One Should You Actually Use?

Use Evidently if you need one tool to cover dashboards, ad-hoc analysis, and integration with the rest of the MLOps stack. It's the lowest-friction starting point, full stop.
Use NannyML if labels arrive late and you need defensible performance estimates for retraining decisions. Pair its CBPE with Evidently's reports — that combo is hard to beat.
Use Alibi Detect if you're monitoring images, embeddings from an LLM, or need an online detector for a streaming pipeline. Also the right pick if you want classifier-based drift with full statistical rigor.

In practice, most mature teams I've worked with run two of the three together: Evidently for visual reporting, and a second library — usually NannyML — for the alerting signal that actually triggers retraining. Don't feel bad about doubling up; the libraries solve overlapping but genuinely different problems.

Production Patterns That Reduce Pager Noise

Use a rolling reference window, not the training set. A frozen training reference flags every seasonal pattern as drift. A 30-day rolling window gives a realistic baseline.
Alert on effect size, not p-values. KS p-values become meaningless at sample sizes above ~50,000. Use PSI > 0.2 or Wasserstein above a domain-calibrated threshold instead.
Distinguish "drift detected" from "act on it." Drift is necessary but not sufficient for retraining. Confirm with NannyML's CBPE or with downstream business metrics before kicking off a pipeline.
Monitor the prediction distribution daily. Prediction drift is cheap to compute, label-free, and (in my experience) catches the majority of real failures earlier than feature drift does.
Version both reference data and detector configs. A drift alert without the exact reference window and threshold used is unreproducible — store them with the same rigor as model artifacts.

Frequently Asked Questions

What is the difference between data drift and concept drift?

Data drift means the distribution of input features P(X) has shifted while the relationship between inputs and target P(Y|X) is unchanged. Concept drift means that relationship itself has changed — the same input now maps to a different output. Data drift is detectable without labels; concept drift requires labels or a proxy like NannyML's CBPE.

Is the KS test enough for production drift detection?

Not on its own. The Kolmogorov-Smirnov test is sensitive to any distribution change, including statistically real but operationally meaningless ones. At sample sizes above 50,000 rows it'll flag drift on essentially every batch. Pair KS p-values with an effect-size measure such as PSI or Wasserstein distance, and apply a Bonferroni or false-discovery-rate correction when running it across many features.

Can I detect drift without ground-truth labels?

Yes. Univariate and multivariate input-distribution tests need no labels. For label-free performance estimation, NannyML's CBPE infers expected accuracy or AUC from the model's confidence calibration on the reference set, and DLE handles regression by directly estimating loss. Combine both with prediction-distribution monitoring for full coverage.

How often should I run drift detection?

Match the cadence to your retraining cycle. Daily batch checks suit most tabular models. High-volume online systems benefit from sliding-window detectors (Alibi Detect's OnlineMMDDrift) updated every few minutes. Anything more frequent than your decision-making cycle is just noise.

Does drift always mean I need to retrain?

No — and treating every alert as a retraining trigger is the fastest way to get drift monitoring deprecated. Confirm impact first: estimate performance with NannyML's CBPE, compare against a business KPI threshold, and only retrain if the estimated metric breaches your service level objective. Many "drift" events are seasonal patterns the model already handles just fine.

Which drift library is best for monitoring LLM embeddings?

Alibi Detect, by a wide margin. Its MMDDrift and LearnedKernelDrift detectors with a PyTorch backend handle high-dimensional embedding spaces natively, and the preprocess_fn hook lets you reduce dimensionality with a UAE or BBSDs detector before testing. Evidently added embedding drift in 2024, but with a narrower set of statistical methods.

Data Drift vs Concept Drift vs Label Drift

The Statistical Tests You'll See Everywhere

Setting Up a Shared Example

Evidently: The Batteries-Included Default

Catching Prediction Drift Without Labels

NannyML: Performance Estimation Without Labels

Alibi Detect: Advanced Detectors and Multi-Modal Data

Head-to-Head Comparison

Which One Should You Actually Use?

Production Patterns That Reduce Pager Noise

Frequently Asked Questions

What is the difference between data drift and concept drift?

Is the KS test enough for production drift detection?

Can I detect drift without ground-truth labels?

How often should I run drift detection?

Does drift always mean I need to retrain?

Which drift library is best for monitoring LLM embeddings?

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)