Your model passed every offline test, hit production, and quietly started losing money three weeks later. Nothing crashed. Accuracy just slipped from 0.91 to 0.78 while the input distributions slid out from under it. That is the failure mode data drift detection exists to catch — and in 2026, three open-source Python libraries pretty much dominate the conversation: Evidently, NannyML, and Alibi Detect.
So, let's compare them on the dimensions that actually matter when you have to ship: detection methods, API ergonomics, runtime cost, label requirements, and how cleanly each plugs into a production monitoring pipeline. You'll also see runnable code for each tool against the same dataset — so you can copy a pattern that fits your stack instead of reading yet another vendor matrix.
Quick disclaimer: I've shipped all three of these in anger, and the "right" pick is rarely the one with the prettiest landing page.
Data Drift vs Concept Drift vs Label Drift
Before picking a tool, get the vocabulary right. Mixing these up is the single biggest source of bad monitoring dashboards (I've seen this kill more pager rotations than the drift itself).
- Data drift (covariate shift): P(X) changes, P(Y|X) stays the same. Inflation pushes transaction amounts up, but the fraud-vs-legit relationship is unchanged. Detect with KS test, PSI, Jensen-Shannon, Wasserstein.
- Concept drift: P(Y|X) changes — the relationship itself shifts. Post-COVID, features that predicted churn in 2019 simply stopped working. Detect with rolling performance metrics on labeled data, or with proxies like CBPE (NannyML's confidence-based performance estimation) when labels are delayed.
- Label drift (prior shift): P(Y) changes. Fraud rate doubles overnight. Detect with a chi-square test on the prediction or label distribution.
A useful rule of thumb: if performance is degrading and features have drifted, you probably have data drift. If performance degrades but feature distributions look stable, suspect concept drift. Most production teams end up monitoring all three on the same dashboard anyway.
The Statistical Tests You'll See Everywhere
Every drift library is, under the hood, a wrapper around a small set of statistical tests. Knowing which test fires when will save you hours of debugging false positives.
| Test | Best for | Threshold rule of thumb |
|---|---|---|
| Kolmogorov-Smirnov | Continuous features, small samples | p < 0.05 |
| Population Stability Index (PSI) | Binned numeric or categorical, large samples | < 0.1 stable, 0.1-0.2 moderate, > 0.2 drift |
| Jensen-Shannon distance | Categorical, bounded [0,1] | > 0.1 worth investigating |
| Wasserstein distance | Continuous, captures magnitude not just shape | Domain-dependent, use rolling baseline |
| Chi-square | Categorical features and labels | p < 0.05 |
| MMD (Maximum Mean Discrepancy) | High-dimensional, embeddings, images | permutation-based p-value |
One thing to watch for: KS and chi-square become extremely sensitive at large sample sizes. A million-row reference window will flag statistically "significant" drift on essentially noise. PSI and Jensen-Shannon are more honest at scale because they measure effect size, not just statistical significance.
Setting Up a Shared Example
To make the comparison concrete, the rest of this article uses one synthetic credit-default dataset, split into a reference window (training-time data) and a current window (production data with injected drift). Run this once and reuse it across each library's example.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
rng = np.random.default_rng(42)
X, y = make_classification(
n_samples=20_000, n_features=8, n_informative=5,
weights=[0.85, 0.15], random_state=42,
)
cols = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=cols)
df["target"] = y
reference = df.iloc[:10_000].copy()
current = df.iloc[10_000:].copy()
# Inject realistic drift: shift two features, rescale a third
current["feature_0"] = current["feature_0"] + 0.7
current["feature_3"] = current["feature_3"] * 1.4
current["feature_5"] = current["feature_5"] + rng.normal(0.3, 0.1, len(current))
reference.to_parquet("reference.parquet")
current.to_parquet("current.parquet")
Evidently: The Batteries-Included Default
Evidently is the most popular open-source choice in 2026, and honestly, it's not hard to see why — it covers the broadest surface area: tabular data drift, target drift, prediction drift, model performance, text and embedding drift, plus an LLM evaluation suite. It generates HTML reports out of the box and exposes JSON for ingestion into Grafana or any alerting stack.
from evidently import Report
from evidently.presets import DataDriftPreset
import pandas as pd
reference = pd.read_parquet("reference.parquet").drop(columns=["target"])
current = pd.read_parquet("current.parquet").drop(columns=["target"])
report = Report(metrics=[DataDriftPreset()])
snapshot = report.run(reference_data=reference, current_data=current)
# Inspect programmatically
result = snapshot.dict()
print(f"Dataset drift detected: {result['metrics'][0]['value']['dataset_drift']}")
print(f"Drifted columns: {result['metrics'][0]['value']['number_of_drifted_columns']}")
# Export an HTML report for stakeholders
snapshot.save_html("drift_report.html")
Evidently auto-selects the right test per column type (KS for numeric, chi-square for categorical) and lets you override the threshold per feature. For monitoring at scale, pair it with Evidently Cloud or push the JSON snapshots into a Postgres-backed dashboard.
The library's biggest strength is also its weakness: defaults are too permissive at large sample sizes. Always sanity-check by setting stattest="psi" with stattest_threshold=0.2 on noisy features — otherwise you'll be fighting alert fatigue inside a week.
Catching Prediction Drift Without Labels
When ground truth is delayed, prediction drift is your earliest warning signal. Evidently supports this directly:
from evidently import Report
from evidently.presets import TargetDriftPreset
report = Report(metrics=[TargetDriftPreset(columns=["prediction"])])
snapshot = report.run(reference_data=reference_with_preds,
current_data=current_with_preds)
NannyML: Performance Estimation Without Labels
NannyML's killer feature is CBPE (Confidence-Based Performance Estimation) and its multivariate cousin DLE (Direct Loss Estimation). These estimate model accuracy, AUC, or RMSE in production before labels arrive — which is exactly the problem most teams solve badly with vague "drift score" alerts.
import nannyml as nml
import pandas as pd
reference = pd.read_parquet("reference.parquet")
current = pd.read_parquet("current.parquet")
# Add timestamp + predictions (simulate scoring)
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(
reference.drop(columns=["target"]), reference["target"]
)
for df_ in (reference, current):
df_["y_pred_proba"] = model.predict_proba(df_.drop(columns=["target"]))[:, 1]
df_["y_pred"] = (df_["y_pred_proba"] > 0.5).astype(int)
df_["timestamp"] = pd.date_range("2026-01-01", periods=len(df_), freq="min")
feature_cols = [c for c in reference.columns if c.startswith("feature_")]
# Multivariate drift via PCA reconstruction error
calc = nml.DataReconstructionDriftCalculator(
column_names=feature_cols,
timestamp_column_name="timestamp",
chunk_size=1000,
).fit(reference)
mv_results = calc.calculate(current)
mv_results.plot().show()
# Estimate AUC without labels
estimator = nml.CBPE(
y_pred_proba="y_pred_proba",
y_pred="y_pred",
y_true="target",
metrics=["roc_auc"],
chunk_size=1000,
problem_type="classification_binary",
).fit(reference)
perf = estimator.estimate(current)
print(perf.to_df()[["chunk", "estimated_roc_auc", "alert"]].head())
Two NannyML patterns are worth memorizing. First, the multivariate DataReconstructionDriftCalculator catches changes in feature correlations that univariate tests miss entirely — a model can have every feature stable in isolation while their joint distribution has quietly shifted under you. Second, CBPE's alert column gives you a deployable signal: trigger retraining workflows on it, not on raw drift scores.
NannyML is the right tool when (a) ground-truth labels arrive days or weeks late, and (b) you need a number you can put in front of stakeholders that says "your model accuracy is now estimated at 0.74 ± 0.03." That second part is genuinely underrated — try defending a retraining ticket with "the KS p-value dropped" sometime.
Alibi Detect: Advanced Detectors and Multi-Modal Data
Alibi Detect, maintained by Seldon, is the heavyweight option. It ships dozens of detectors — KS, MMD, Chi-square, Fisher's Exact Test, Learned Kernel MMD, Classifier Drift, Spot-the-Diff, plus dedicated detectors for images, text, and time series. If you're monitoring embeddings, model logits, or anything beyond a tabular DataFrame, this is usually the right answer.
import numpy as np
import pandas as pd
from alibi_detect.cd import KSDrift, MMDDrift, ClassifierDrift
reference = pd.read_parquet("reference.parquet").drop(columns=["target"]).to_numpy()
current = pd.read_parquet("current.parquet").drop(columns=["target"]).to_numpy()
# Univariate KS test per feature with Bonferroni correction
ks = KSDrift(reference, p_val=0.05, correction="bonferroni")
print(ks.predict(current))
# Multivariate MMD with permutation test
mmd = MMDDrift(reference, p_val=0.05, n_permutations=100, backend="pytorch")
print(mmd.predict(current))
# Classifier-based drift: train a model to distinguish ref from current
clf = ClassifierDrift(
reference, model=None, backend="sklearn",
p_val=0.05, n_folds=5, preprocess_at_init=True,
)
print(clf.predict(current))
The ClassifierDrift approach is, in my opinion, one of the most underrated detectors in the entire ecosystem. The intuition is dead simple: if a binary classifier can reliably distinguish reference rows from current rows, the distributions have drifted. The cross-validated AUC is the drift magnitude, and feature importances tell you exactly which columns moved. Beautiful.
The downside, confirmed in the 2024 D3Bench microbenchmark, is runtime. Alibi Detect's MMD and learned-kernel detectors can run two orders of magnitude slower than NannyML or Evidently on the same data. For batch monitoring this is fine; for low-latency online detection, stick to KS or Chi-square detectors.
Head-to-Head Comparison
| Capability | Evidently | NannyML | Alibi Detect |
|---|---|---|---|
| Univariate drift tests | 20+ built-in | KS, Chi-square, JS, Wasserstein, L-infinity, Hellinger | KS, Chi-square, Fisher, MMD, LSDD, Cramér-von Mises |
| Multivariate drift | Domain classifier | PCA reconstruction error, domain classifier | MMD, Learned Kernel MMD, Classifier Drift, Spot-the-Diff |
| Performance estimation without labels | No | Yes (CBPE, DLE, M-CBPE) | No |
| Image / text / embedding drift | Embeddings only | Tabular only | Native support |
| Online / streaming detectors | No | No | Yes (Online MMD, Online LSDD) |
| HTML reports / dashboards | Excellent | Plotly plots | Minimal — bring your own |
| Runtime on tabular data | Fast | Fastest | Slowest (10-100×) |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Best for | General-purpose monitoring, stakeholder reports | Pinpointing drift timing, label-free performance | Multi-modal data, advanced detectors, online streams |
Which One Should You Actually Use?
- Use Evidently if you need one tool to cover dashboards, ad-hoc analysis, and integration with the rest of the MLOps stack. It's the lowest-friction starting point, full stop.
- Use NannyML if labels arrive late and you need defensible performance estimates for retraining decisions. Pair its CBPE with Evidently's reports — that combo is hard to beat.
- Use Alibi Detect if you're monitoring images, embeddings from an LLM, or need an online detector for a streaming pipeline. Also the right pick if you want classifier-based drift with full statistical rigor.
In practice, most mature teams I've worked with run two of the three together: Evidently for visual reporting, and a second library — usually NannyML — for the alerting signal that actually triggers retraining. Don't feel bad about doubling up; the libraries solve overlapping but genuinely different problems.
Production Patterns That Reduce Pager Noise
- Use a rolling reference window, not the training set. A frozen training reference flags every seasonal pattern as drift. A 30-day rolling window gives a realistic baseline.
- Alert on effect size, not p-values. KS p-values become meaningless at sample sizes above ~50,000. Use PSI > 0.2 or Wasserstein above a domain-calibrated threshold instead.
- Distinguish "drift detected" from "act on it." Drift is necessary but not sufficient for retraining. Confirm with NannyML's CBPE or with downstream business metrics before kicking off a pipeline.
- Monitor the prediction distribution daily. Prediction drift is cheap to compute, label-free, and (in my experience) catches the majority of real failures earlier than feature drift does.
- Version both reference data and detector configs. A drift alert without the exact reference window and threshold used is unreproducible — store them with the same rigor as model artifacts.
Frequently Asked Questions
What is the difference between data drift and concept drift?
Data drift means the distribution of input features P(X) has shifted while the relationship between inputs and target P(Y|X) is unchanged. Concept drift means that relationship itself has changed — the same input now maps to a different output. Data drift is detectable without labels; concept drift requires labels or a proxy like NannyML's CBPE.
Is the KS test enough for production drift detection?
Not on its own. The Kolmogorov-Smirnov test is sensitive to any distribution change, including statistically real but operationally meaningless ones. At sample sizes above 50,000 rows it'll flag drift on essentially every batch. Pair KS p-values with an effect-size measure such as PSI or Wasserstein distance, and apply a Bonferroni or false-discovery-rate correction when running it across many features.
Can I detect drift without ground-truth labels?
Yes. Univariate and multivariate input-distribution tests need no labels. For label-free performance estimation, NannyML's CBPE infers expected accuracy or AUC from the model's confidence calibration on the reference set, and DLE handles regression by directly estimating loss. Combine both with prediction-distribution monitoring for full coverage.
How often should I run drift detection?
Match the cadence to your retraining cycle. Daily batch checks suit most tabular models. High-volume online systems benefit from sliding-window detectors (Alibi Detect's OnlineMMDDrift) updated every few minutes. Anything more frequent than your decision-making cycle is just noise.
Does drift always mean I need to retrain?
No — and treating every alert as a retraining trigger is the fastest way to get drift monitoring deprecated. Confirm impact first: estimate performance with NannyML's CBPE, compare against a business KPI threshold, and only retrain if the estimated metric breaches your service level objective. Many "drift" events are seasonal patterns the model already handles just fine.
Which drift library is best for monitoring LLM embeddings?
Alibi Detect, by a wide margin. Its MMDDrift and LearnedKernelDrift detectors with a PyTorch backend handle high-dimensional embedding spaces natively, and the preprocess_fn hook lets you reduce dimensionality with a UAE or BBSDs detector before testing. Evidently added embedding drift in 2024, but with a narrower set of statistical methods.