Conformal Prediction in Python with MAPIE 1.0: Distribution-Free Uncertainty Quantification (2026)
A practical 2026 guide to conformal prediction in Python with MAPIE 1.0: split conformal, CQR, APS/RAPS classification, alpha selection, and how to keep coverage under drift.
Conformal prediction is a distribution-free framework that wraps any trained model to produce prediction intervals or sets with a mathematically guaranteed coverage rate. For example, an interval that contains the true label at least 90% of the time, regardless of the underlying data distribution. In Python, the MAPIE library (Model Agnostic Prediction Interval Estimator) is the de-facto reference implementation, and version 1.0 (released late 2025) ships with a redesigned scikit-learn–compatible API, conformalized quantile regression, and risk-controlling prediction sets out of the box.
Conformal prediction adds finite-sample, distribution-free coverage guarantees to any predictor. No assumption of normality, calibration, or correctly specified likelihood is required.
MAPIE 1.0 exposes regressors, classifiers, and risk controllers through a unified conformalize() / predict_interval() / predict_set() API that plugs into existing scikit-learn pipelines.
Split (inductive) conformal is the fastest method and the right default. CV+, Jackknife+ and CQR (conformalized quantile regression) trade compute for tighter, locally adaptive intervals.
Coverage is marginal by construction. Achieving conditional coverage across subgroups requires APS, RAPS, or Mondrian conformal predictors.
Conformal guarantees rely on data exchangeability. Under covariate shift or concept drift, you must either reweight residuals or re-calibrate on fresh data.
What is conformal prediction in machine learning?
Conformal prediction (CP) is a framework, introduced by Vovk, Gammerman and Shafer in the early 2000s, that converts the point prediction of any black-box model into a prediction set (for classification) or prediction interval (for regression) with a finite-sample coverage guarantee. Formally, given a target miscoverage rate α ∈ (0, 1), a conformal predictor outputs a set C(X) such that
P( Y ∈ C(X) ) ≥ 1 − α
holds in expectation over the data, provided the calibration and test samples are exchangeable. The guarantee is distribution-free. It does not require the residuals to be Gaussian, the classifier to be well-calibrated, or the likelihood to be correctly specified. Angelopoulos and Bates' tutorial paper "A Gentle Introduction to Conformal Prediction" is the standard reference if you want the proofs.
The mechanism is disarmingly simple. You split your training data into a proper training fold and a held-out calibration fold. Fit the model on the training fold, compute a non-conformity score on the calibration fold (for regression, the absolute residual; for classification, one minus the softmax of the true class), take the empirical (1−α) quantile of those scores, and use that quantile as a threshold around the test prediction. That threshold is what gives you the interval or set, and the only ingredient you ever need from your model is its predict (or predict_proba) method.
Why MAPIE in 2026?
Several libraries implement conformal prediction in Python (crepes, nonconformist, puncc, and torchcp are alternatives), but MAPIE remains the most actively maintained and the only one with a fully scikit-learn–compatible interface across regression, classification, time series, multi-label, and risk control. The 1.0 release in November 2025 was a breaking redesign with three concrete wins:
Unified API. Every estimator now follows the same three-step pattern: fit() on training data, conformalize() on calibration data, predict_interval() or predict_set() on test data. Previous versions overloaded fit with a cv parameter that was easy to misuse.
Native CQR and locally adaptive scores. Conformalized quantile regression (Romano, Patterson, Candès 2019) is a first-class regressor; you wrap any quantile estimator (e.g. GradientBoostingRegressor(loss="quantile")) and get heteroscedastic intervals for free.
Risk-controlling prediction sets. The new mapie.risk_control module implements RCPS and LTT, which are useful when "coverage" is the wrong objective and you actually want to bound false-negative or precision risk at a chosen level.
If you're already using scikit-learn pipelines, MAPIE drops in with no refactor. Honestly, that matters more than benchmarks: in my experience, the lifecycle cost of an uncertainty library is dominated by how cleanly it composes with the rest of your stack.
Installing MAPIE 1.0 and preparing data
MAPIE 1.0 requires Python 3.10+ and scikit-learn ≥ 1.4. Install it with pip or uv:
Throughout this article I'll use the California Housing dataset for regression and the Covertype subset for classification, both shipped with scikit-learn so the snippets are runnable without external downloads. The critical setup step is the three-way split: training data for the model, calibration data for the conformal scores, and test data for the coverage check.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True)
# 60 / 20 / 20 -> train / calibrate / test
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.40, random_state=42
)
X_cal, X_test, y_cal, y_test = train_test_split(
X_temp, y_temp, test_size=0.50, random_state=42
)
print(X_train.shape, X_cal.shape, X_test.shape)
# (12384, 8) (4128, 8) (4128, 8)
A 60/20/20 split is a reasonable default. For split conformal, the marginal coverage variance scales with the inverse of the calibration set size, so I'd push the calibration fold to 30%+ on datasets with fewer than 5,000 rows. For more on principled splitting strategies, see our deep dive on cross-validation strategies in scikit-learn.
Split conformal regression: the minimal example
The simplest conformal regressor is "split" (also called "inductive") conformal with absolute-residual non-conformity scores. It costs one model fit and one quantile computation. In MAPIE 1.0:
Two things to notice. First, the empirical coverage on the test set lands right at the nominal 90%. That's the theorem doing its job. Second, the interval width is constant across test points, because the absolute-residual score has no input dependence. Heteroscedasticity is the most common reason practitioners say "conformal intervals look too wide". Split conformal averages residual magnitude across the entire calibration set, so a model that is precise in dense regions and noisy in sparse regions still gets one global width.
CV+, Jackknife+, and conformalized quantile regression
To get adaptive widths without sacrificing the guarantee, you have three good options. CV+ and Jackknife+ (Barber, Candès, Ramdas, Tibshirani 2021) refit the model K times on cross-validation folds and aggregate out-of-fold residuals; they recover the data lost to a held-out calibration set at the cost of K model fits. Conformalized quantile regression (CQR) instead asks the base model to predict the α/2 and 1−α/2 conditional quantiles, then uses the calibration set only to adjust those quantile predictions by a single additive offset. CQR is my default whenever the residuals are heteroscedastic and the base learner can produce quantile predictions.
Coverage still sits right at 90%, the mean width is roughly 16% narrower than split conformal, and (crucially) the width varies across the input space. Wide where the model is uncertain, tight where it's confident. That's the locally adaptive behavior most users expect from "uncertainty quantification" in the first place.
When choosing between CV+ and CQR, I lean on a rule from the original CQR paper: if your base learner has good quantile estimates (gradient boosting, quantile forests, NGBoost), use CQR; otherwise fall back to CV+ or Jackknife+, which work with any point predictor. CV+ is also the right choice if you can't afford to hold out a calibration set. See the related discussion in our piece on hyperparameter tuning with Optuna and GridSearchCV, which has overlapping splitting concerns.
Classification with APS and RAPS prediction sets
For classification, conformal prediction returns a set of plausible labels rather than a single prediction. Easy inputs return singleton sets, hard inputs return multi-label sets. The two scores that matter in 2026 are APS (Adaptive Prediction Sets, Romano et al. 2020) and its regularized variant RAPS (Angelopoulos et al. 2021), which fights the tendency of APS to include long tails on poorly calibrated softmax outputs.
The size of the prediction set is itself a useful uncertainty signal. A medical triage workflow, for example, can route singleton predictions to automated handling and escalate larger sets to human review. It's a pattern that pairs naturally with the model-monitoring approach we covered in data drift detection with Evidently and NannyML.
How do you choose the miscoverage level alpha?
α is a business decision, not a statistical one. A 90% interval (α = 0.10) is the most common default in academic benchmarks because it gives readable plots; production systems often pick α based on the asymmetric cost of missing the true value versus the cost of a wider interval. A few practical anchors:
α = 0.01 (99% coverage): safety-critical loops (clinical, industrial control). Intervals will be wide; that's the point.
One subtlety worth flagging: requesting α smaller than 1 / (n_cal + 1) is impossible. Split conformal cannot give you 99% coverage with 50 calibration points, because the quantile estimate has no support up there. MAPIE will warn if you cross that boundary, and if you see the warning, the fix is more calibration data, not a different library. (I hit this exact bug shipping a small-sample healthcare model in 2024 and ended up holding out 40% of the labels just to keep the guarantee meaningful.)
Diagnostics, conditional coverage, and common pitfalls
The coverage guarantee from split conformal is marginal: averaged across all test points, you get 1 − α. It does not guarantee that coverage holds in every subgroup. A model can be 90% accurate overall while being 60% accurate for a minority subpopulation and 95% accurate for the majority, and the conformal interval will inherit that imbalance.
The diagnostic I always run after fitting any conformal predictor is a coverage-by-bin plot, stratified by the variable I suspect drives heterogeneity (age, geography, model confidence). If conditional coverage collapses in a subgroup, the remedies are well known:
Mondrian conformal prediction: partition the calibration set by the protected variable and compute a separate quantile per bin. MAPIE exposes this via the groups argument to conformalize().
Locally adaptive scores: swap the absolute-residual score for one normalized by a learned variance estimate (the "residual_normalized" score in MAPIE).
CQR or APS: both give better conditional coverage than their non-adaptive predecessors essentially for free.
Other footguns. Never re-use training data as the calibration set (the guarantee dissolves silently). Never apply conformal scores computed before data preprocessing changes (e.g. a refit imputer). And never trust the guarantee under non-exchangeable data without explicit weighting. The next section covers that case.
Conformal prediction under distribution shift
The exchangeability assumption is the load-bearing piece of the conformal guarantee. Real production data violates it constantly: seasonality, covariate shift, label drift. The literature has three pragmatic responses:
Weighted conformal prediction (Tibshirani et al. 2019) reweights calibration residuals by an estimated likelihood ratio between calibration and test distributions. It works well for known covariate shift, though you do need a density ratio estimator.
Adaptive Conformal Inference (Gibbs & Candès 2021) treats α as a state variable and updates it online based on observed coverage, recovering long-run coverage even under arbitrary distribution shift. Implemented in MAPIE as AdaptiveConformalRegressor.
Periodic recalibration: the unromantic but most reliable approach. Re-run conformalize() on a recent window of labeled data every N predictions. This is what most production teams I've worked with actually do.
What is the difference between conformal prediction and a confidence interval?
A classical confidence interval is a statement about an unknown parameter (e.g. a mean) under a parametric model; a conformal prediction interval is a statement about an unobserved future label under no parametric assumptions. Conformal intervals have a finite-sample, distribution-free coverage guarantee; classical confidence intervals only have asymptotic guarantees and assume the model is correctly specified.
Does conformal prediction work with deep learning models?
Yes. Conformal prediction is model-agnostic. Any function that exposes predict() or predict_proba() is wrappable, including PyTorch and TensorFlow models. MAPIE works with deep models via thin scikit-learn estimator wrappers; for native PyTorch users, torchcp is a popular alternative with the same theoretical backbone.
How much calibration data do I need for conformal prediction?
The coverage guarantee holds for any n_cal ≥ 1, but the variance of the empirical quantile shrinks like 1/√n_cal. As a rule of thumb, use at least 500 calibration points for α = 0.10, 1000 for α = 0.05, and 5000+ for α = 0.01. Below those thresholds, prefer Jackknife+ or CV+ to avoid wasting data.
Can conformal prediction replace model calibration techniques like Platt scaling?
They solve different problems. Platt scaling and isotonic regression rescale predicted probabilities to better match observed frequencies; conformal prediction builds sets with a coverage guarantee regardless of probability calibration. The two are complementary: a well-calibrated classifier produces tighter conformal sets, but conformal prediction will still give you the coverage guarantee even when probabilities are miscalibrated.
Is MAPIE production-ready in 2026?
Yes. MAPIE 1.0 reached stable status in late 2025, follows semantic versioning, has scikit-learn estimator compliance, and is used in regulated settings (insurance, healthcare, energy forecasting). The main operational consideration is recalibration cadence under drift, not library maturity.
Eight worked examples that swap pandas-style .apply() in Polars for native when/then/otherwise expressions, plus benchmarks showing the real speedup on a 50M-row dataset.
A field-tested walkthrough of the five reasons your dbt incremental model silently runs a full refresh, with a reproducible debug recipe using compiled SQL, run results, and warehouse query history.
Every Polars-vs-pandas comparison I've read in the last year uses a 500MB CSV and announces that Polars is faster. I wanted to know what happens on the workload I actually run at work.