Python Missing Data: Pandas & sklearn G…

Every data scientist has done it. You load a dataset, spot some NaN values, fire off a df.fillna(0) or df.dropna(), and move on to the "interesting" part of the analysis. It works — until it doesn't. Until your model quietly learns that zero is a meaningful value in a column where it never should have appeared, or you throw away 40% of your rows and wonder why your classifier can't generalize.

Missing data isn't just a preprocessing nuisance. How you handle it directly shapes your model's accuracy, fairness, and reliability. Get it wrong, and you introduce bias that no amount of hyperparameter tuning can fix.

Get it right, and you preserve the statistical properties of your data while giving your model every possible signal to learn from.

This guide walks through every practical approach to handling missing data in Python, from basic pandas operations through scikit-learn's full imputer toolkit. We're using pandas 3.0 and scikit-learn 1.8 throughout, so every code example reflects the APIs you'll actually encounter in 2026. If you've already worked through a data cleaning or feature engineering workflow, this slots in right between those steps — it's the deep dive into imputation that often gets glossed over.

Understanding Why Data Goes Missing: MCAR, MAR, and MNAR

Before reaching for any imputation tool, you need to understand why your data is missing. The mechanism behind the missingness determines which strategies are valid and which will quietly introduce bias. Statisticians Donald Rubin and Roderick Little formalized three categories that remain the foundation of modern missing data theory.

Missing Completely At Random (MCAR)

Data is MCAR when the probability that a value is missing has nothing to do with the value itself or any other variable in the dataset. A lab sample gets dropped on the floor. A sensor temporarily loses power. A survey respondent accidentally skips a question. The missingness is purely random — it's essentially noise in your data collection process.

MCAR is the easiest scenario to deal with because any imputation method works without introducing systematic bias. Even simple deletion is valid, since the remaining data is still a random sample of the full population. The catch? True MCAR is actually pretty rare in practice.

Missing At Random (MAR)

Data is MAR when the probability of missingness depends on other observed variables, but not on the missing value itself. For example, younger patients in a medical study might be less likely to have cholesterol measurements — the missingness correlates with age (observed), not with the cholesterol level itself (unobserved).

MAR is the most common real-world scenario, and it's where model-based imputation methods like KNN and iterative imputation really shine. These methods can leverage the relationships between observed variables to produce unbiased estimates of the missing values.

Missing Not At Random (MNAR)

Data is MNAR when the probability of missingness depends on the missing value itself. People with very high incomes are less likely to report their income on surveys. Patients with severe symptoms are more likely to drop out of clinical trials. The missingness is the signal.

MNAR is the hardest case, honestly. No imputation method can fully recover MNAR data without additional information or domain-specific modeling. The best strategies involve sensitivity analysis (testing how your conclusions change under different assumptions) and collecting auxiliary data that helps explain the missingness.

One important note: you can't statistically distinguish MAR from MNAR using the observed data alone — you have to rely on domain knowledge to make that judgment call.

Detecting and Quantifying Missing Data with Pandas 3.0

Before choosing an imputation strategy, you need a clear picture of what's actually missing, how much, and whether there are patterns. Pandas 3.0 gives you everything you need for this.

Basic Detection

import pandas as pd
import numpy as np

# Load your dataset
df = pd.read_csv("customer_data.csv")

# Count missing values per column
missing_counts = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df)) * 100

# Combine into a summary
missing_summary = pd.DataFrame({
    "missing_count": missing_counts,
    "missing_pct": missing_pct.round(2)
}).query("missing_count > 0").sort_values("missing_pct", ascending=False)

print(missing_summary)

This gives you a quick overview of which columns have missing values and how severe the problem is. As a rough guideline: columns with less than 5% missing are usually safe for simple imputation, 5–20% calls for more sophisticated approaches, and above 20–30% warrants careful consideration of whether the column is even worth keeping.

Pandas 3.0: `pd.NA` vs `np.nan`

Pandas 3.0 (released January 2026) enforces a clearer distinction between np.nan — the legacy float-based sentinel — and pd.NA, the modern type-agnostic missing value indicator. If you're working with nullable dtypes like Int64, Float64, or string[pyarrow], your missing values are now pd.NA, not np.nan.

Why does this matter for imputation? Because pd.NA propagates differently in arithmetic and comparisons. Where np.nan == "a" returns False, pd.NA == "a" returns pd.NA — it follows three-valued (Kleene) logic. If you're piping data into scikit-learn, you'll typically want to convert to np.nan for compatibility:

# Convert nullable dtypes to numpy-compatible types before sklearn
df_sklearn = df.to_numpy(dtype="float64", na_value=np.nan)

Spotting Missingness Patterns

Random missing values are one thing, but patterns of missingness often reveal systematic issues. A quick correlation matrix of the missingness indicators can be surprisingly revealing:

# Create a missingness indicator matrix
missing_matrix = df.isnull().astype(int)

# Check correlations between missing indicators
missing_corr = missing_matrix.corr()
print(missing_corr)

If two columns tend to be missing together (high correlation), that's often a sign of a shared data collection issue — and it probably indicates MAR rather than MCAR. For visual exploration, libraries like missingno can generate heatmaps and dendrograms that make these patterns immediately obvious.

Simple Imputation with Pandas

For straightforward cases — MCAR data with low missingness rates — pandas' built-in methods are often all you need. Let's walk through the main options.

Dropping Missing Values

# Drop rows where ANY column is NaN
df_clean = df.dropna()

# Drop rows where specific columns are NaN
df_clean = df.dropna(subset=["age", "income"])

# Drop columns where more than 50% of values are missing
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

Dropping works when the missing data is MCAR and the amount is small (under 5%). With larger amounts, you lose statistical power and risk biasing your sample toward complete cases.

Constant and Statistical Fill

# Fill with a constant
df["category"] = df["category"].fillna("Unknown")

# Fill with column mean (numerical)
df["age"] = df["age"].fillna(df["age"].mean())

# Fill with median (more robust to outliers)
df["income"] = df["income"].fillna(df["income"].median())

# Fill with mode (categorical data)
df["city"] = df["city"].fillna(df["city"].mode()[0])

Mean imputation is the most common approach, but it has a well-known flaw: it shrinks the variance of the imputed column. If 20% of your income column is missing and you fill it all with the mean, your income distribution now has an artificial spike right at the center — reducing the apparent variability and potentially weakening relationships with other variables.

Median imputation is more robust when the distribution is skewed, since outliers don't pull the imputed value away from where most of the data actually lives.

Forward Fill and Backward Fill

# Forward fill — carry the last valid observation forward
df["temperature"] = df["temperature"].ffill()

# Backward fill — carry the next valid observation backward
df["temperature"] = df["temperature"].bfill()

# Limit the fill to avoid propagating over long gaps
df["temperature"] = df["temperature"].ffill(limit=3)

Forward fill (ffill) and backward fill (bfill) are natural choices for time series and sequential data where nearby observations are more relevant than global statistics. The limit parameter is important here — without it, a single valid observation could fill an arbitrarily long gap, which is almost never what you want.

Interpolation

# Linear interpolation (treats values as equally spaced)
df["sensor_reading"] = df["sensor_reading"].interpolate(method="linear")

# Time-based interpolation (uses the datetime index)
df["sensor_reading"] = df["sensor_reading"].interpolate(method="time")

# Polynomial interpolation (fits a polynomial curve)
df["sensor_reading"] = df["sensor_reading"].interpolate(method="polynomial", order=2)

Interpolation estimates missing values from surrounding data points, making it ideal for continuous measurements like sensor data or stock prices. Linear interpolation is the default and works well for smooth signals. For more complex patterns, polynomial or spline interpolation can capture curvature — but watch out for overfitting at the edges of gaps.

Group-Specific Imputation

One of the most underused pandas techniques (in my experience, at least) is filling missing values with group-specific statistics rather than global ones:

# Fill missing salary with the median salary for each department
df["salary"] = df.groupby("department")["salary"].transform(
    lambda x: x.fillna(x.median())
)

This is significantly more accurate than a global fill because it respects the structure of your data. An engineer's missing salary is better estimated by the median engineering salary than by the company-wide median. Makes sense, right? This approach works well under MAR when the grouping variable explains the missingness.

Scikit-Learn SimpleImputer: The Production-Ready Baseline

While pandas methods work fine in notebooks, scikit-learn's SimpleImputer is built for production ML pipelines. It implements the fit/transform API, which means it learns imputation parameters from training data and applies them consistently to new data — preventing data leakage during cross-validation.

from sklearn.impute import SimpleImputer

# Numerical: impute with median
num_imputer = SimpleImputer(strategy="median")
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])

# Categorical: impute with most frequent value
cat_imputer = SimpleImputer(strategy="most_frequent")
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

# Constant fill
const_imputer = SimpleImputer(strategy="constant", fill_value=-1)
df[numeric_cols] = const_imputer.fit_transform(df[numeric_cols])

The key advantage over raw pandas is that SimpleImputer remembers what it learned during fit(). If the median income in your training data is $52,000, it'll use that same value when transforming test data or new production data — even if the test set has a different median. This prevents the subtle data leakage that happens when you compute statistics on data that includes your validation or test sets.

Preserving Missingness Information

Sometimes the fact that a value was missing is itself a useful feature. A customer who didn't provide their age might behave differently from one who did. You can capture this signal with the add_indicator parameter:

imputer = SimpleImputer(strategy="median", add_indicator=True)
result = imputer.fit_transform(df[["age", "income"]])

# Result now has 4 columns:
# age_imputed, income_imputed, age_missing_indicator, income_missing_indicator

The MissingIndicator appends binary columns that flag which values were originally missing. This is especially valuable when the missingness mechanism is MAR or MNAR — the indicator gives your model direct access to the missingness pattern as a predictive feature.

KNNImputer: Leveraging Similarity Between Observations

When relationships between features matter, K-Nearest Neighbors imputation uses similar observations to estimate missing values. Instead of filling with a global statistic, KNNImputer finds the k most similar rows (based on non-missing features) and averages their values.

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5, weights="distance")
df_imputed = pd.DataFrame(
    knn_imputer.fit_transform(df[numeric_cols]),
    columns=numeric_cols
)

How It Works Under the Hood

KNNImputer uses a special nan_euclidean distance metric that computes distances using only the features that both samples have present. For a missing value in row i, it finds the k nearest neighbors among rows that do have that value, then computes a weighted average.

Key Parameters

n_neighbors (default 5): Higher values produce smoother imputations but can over-smooth local patterns. Lower values are more sensitive to individual neighbors but noisier.
weights: "uniform" treats all neighbors equally. "distance" weights closer neighbors more heavily — usually a better default in my experience.
add_indicator: Same as SimpleImputer — appends binary columns flagging originally missing values.

When KNNImputer Works Best

KNN imputation excels when your data has local structure — when similar observations genuinely tend to have similar values for the missing feature. It performs well on small-to-medium datasets (under ~50k rows) with moderate missingness. On larger datasets, the distance computation gets expensive, and you might want to look at IterativeImputer instead.

One important caveat: scale your features before using KNNImputer. Since it relies on distance calculations, features with larger ranges will dominate the neighbor search. Use StandardScaler or MinMaxScaler as a preprocessing step, or (better yet) wrap them together in a pipeline.

IterativeImputer (MICE): Multivariate Imputation by Chained Equations

IterativeImputer is scikit-learn's most sophisticated imputation tool. Inspired by the R MICE package, it models each feature with missing values as a function of all other features, iterating through round-robin regression until the imputed values converge.

# IterativeImputer is still experimental in sklearn 1.8
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iterative_imputer = IterativeImputer(
    max_iter=10,
    random_state=42,
    initial_strategy="median"
)
df_imputed = pd.DataFrame(
    iterative_imputer.fit_transform(df[numeric_cols]),
    columns=numeric_cols
)

How the Iterative Process Works

All missing values are initially filled using a simple strategy (mean, median, or most frequent).
For each feature with missing values, the algorithm fits a regression model using all other features as predictors.
The regressor predicts the missing values for that feature.
Steps 2–3 repeat for every feature with missing values — one full pass through all features is called a "round."
Multiple rounds run (controlled by max_iter) until the imputed values stabilize.

Choosing the Estimator

The default estimator is BayesianRidge, which works well for most cases. But you can plug in any scikit-learn regressor:

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Use RandomForest — mimics the popular R "missForest" approach
rf_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=100, random_state=42),
    max_iter=10,
    random_state=42
)

# Use BayesianRidge (default) — faster, good for linear relationships
br_imputer = IterativeImputer(
    max_iter=10,
    random_state=42
)

The RandomForestRegressor variant is particularly powerful because it naturally captures non-linear relationships and interactions between features. In practice, KNN and Random Forest-based imputation consistently produce the best results across a variety of datasets and missingness patterns.

Important: Experimental Status

As of scikit-learn 1.8, IterativeImputer still carries the experimental label. You need to explicitly enable it with from sklearn.experimental import enable_iterative_imputer. The API has been stable in practice for a while, but the team is still refining convergence criteria and default estimator selection. Don't let the "experimental" tag scare you off — it's been used in production for years at this point.

Building Imputation Into Scikit-Learn Pipelines

So here's where things get really powerful. The real strength of scikit-learn's imputers comes when you embed them in pipelines alongside other preprocessing steps. This ensures imputation parameters are learned exclusively from training data, preventing data leakage.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define column groups
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["city", "employment_type", "education"]

# Numeric pipeline: impute with KNN, then scale
numeric_pipeline = Pipeline(steps=[
    ("imputer", KNNImputer(n_neighbors=5, weights="distance")),
    ("scaler", StandardScaler())
])

# Categorical pipeline: impute with most frequent, then encode
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])

# Full pipeline: preprocess then classify
full_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(random_state=42))
])

# Fit on training data — imputation parameters learned here
full_pipeline.fit(X_train, y_train)

# Predict on test data — same imputation applied consistently
predictions = full_pipeline.predict(X_test)

This pipeline does everything right: the KNN imputer learns neighbor distances from training data only, the scaler learns mean and standard deviation from training data only, and the one-hot encoder learns categories from training data only. When you call predict(), the exact same transformations get applied to new data — no leakage, no inconsistencies.

Hyperparameter Tuning Across Imputation Strategies

Since the imputer is part of the pipeline, you can tune its parameters alongside the model's hyperparameters using GridSearchCV:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "preprocessor__num__imputer__n_neighbors": [3, 5, 7, 11],
    "classifier__n_estimators": [100, 200],
    "classifier__max_depth": [3, 5, 7]
}

grid_search = GridSearchCV(
    full_pipeline,
    param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best imputer neighbors: {grid_search.best_params_}")

The double-underscore notation (preprocessor__num__imputer__n_neighbors) lets you reach into nested pipeline components. This means you can systematically figure out whether 3 or 11 neighbors produces better downstream model performance — treating imputation as a tunable part of the modeling pipeline rather than a fixed preprocessing step. I've found this approach surprisingly impactful in competitions and production systems alike.

A Practical Decision Framework

With all these tools at your disposal, how do you actually choose? Here's a decision framework based on your data characteristics.

Based on Missingness Rate

Under 5%: SimpleImputer with mean or median is usually sufficient. The choice of method barely matters at this level.
5–20%: Use KNNImputer or IterativeImputer. The added complexity is justified because simple methods start introducing noticeable bias.
20–50%: IterativeImputer with a RandomForestRegressor estimator, combined with add_indicator=True. Seriously consider whether the column is even worth keeping.
Over 50%: Strongly consider dropping the column entirely. If you must keep it, use IterativeImputer and always add a missing indicator feature.

Based on Data Type

Continuous numerical: KNNImputer or IterativeImputer for accuracy; SimpleImputer(strategy="median") for speed.
Categorical: SimpleImputer(strategy="most_frequent") or fill with a dedicated "Missing" category.
Time series: Pandas interpolate(method="time") or ffill(limit=n).
Mixed types: Use ColumnTransformer to apply different strategies to different column groups.

Based on Missingness Mechanism

MCAR: Any method works. Simple imputation or even deletion is fine.
MAR: Use model-based methods (KNNImputer, IterativeImputer) that leverage relationships between variables.
MNAR: Add missing indicators (add_indicator=True), consider domain-specific models, and always run sensitivity analyses.

Common Pitfalls and How to Avoid Them

1. Imputing Before Splitting Your Data

This is the single most common — and most damaging — imputation mistake I see. If you compute the mean of a column before splitting into train and test sets, information from the test set leaks into the training process. Always split first, then fit imputers on training data only:

from sklearn.model_selection import train_test_split

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Then fit imputer on training data only
imputer = SimpleImputer(strategy="median")
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)  # transform only, no fit

2. Using Mean Imputation on Skewed Distributions

Mean imputation pulls imputed values toward the center of the distribution. On skewed data (think income, property values, insurance claims), the mean is far from the typical value. Use the median instead, or better yet, use a model-based imputer that respects the actual distribution shape.

3. Forgetting to Scale Before KNNImputer

This one bites people all the time. KNN relies on distance calculations, and if your features have wildly different scales (age in years vs income in dollars), the distance computation will be completely dominated by the high-magnitude feature. Always scale first, or use a pipeline that handles scaling before imputation.

4. Ignoring the Missingness Pattern

Blindly imputing without understanding why data is missing can introduce bias. Spend five minutes visualizing missingness patterns and checking for correlations between missing indicators. It's a small time investment that can save you from fundamentally flawed models.

5. Treating All Columns the Same Way

Different columns often have different missingness mechanisms and different optimal imputation strategies. A categorical column with 2% missing values and a numerical column with 30% missing values shouldn't receive the same treatment. Use ColumnTransformer to customize per-column or per-group strategies — that's literally what it's designed for.

Full Working Example: End-to-End Imputation Pipeline

Let's tie everything together with a complete, runnable example using the Titanic dataset:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Load data
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Select features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
target = "Survived"

X = df[features]
y = df[target]

# Check missing data
print("Missing values:")
print(X.isnull().sum())
print()

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define column groups
numeric_features = ["Age", "SibSp", "Parch", "Fare"]
categorical_features = ["Pclass", "Sex", "Embarked"]

# Strategy 1: Simple imputation baseline
simple_pipeline = Pipeline(steps=[
    ("preprocessor", ColumnTransformer(transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), numeric_features),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown="ignore"))
        ]), categorical_features)
    ])),
    ("classifier", GradientBoostingClassifier(random_state=42))
])

# Strategy 2: KNN imputation
knn_pipeline = Pipeline(steps=[
    ("preprocessor", ColumnTransformer(transformers=[
        ("num", Pipeline([
            ("imputer", KNNImputer(n_neighbors=5, weights="distance")),
            ("scaler", StandardScaler())
        ]), numeric_features),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown="ignore"))
        ]), categorical_features)
    ])),
    ("classifier", GradientBoostingClassifier(random_state=42))
])

# Strategy 3: Iterative imputation with missing indicators
iterative_pipeline = Pipeline(steps=[
    ("preprocessor", ColumnTransformer(transformers=[
        ("num", Pipeline([
            ("imputer", IterativeImputer(
                max_iter=10, random_state=42, add_indicator=True
            )),
            ("scaler", StandardScaler())
        ]), numeric_features),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown="ignore"))
        ]), categorical_features)
    ])),
    ("classifier", GradientBoostingClassifier(random_state=42))
])

# Compare strategies with cross-validation
strategies = {
    "Simple (median)": simple_pipeline,
    "KNN (k=5)": knn_pipeline,
    "Iterative (MICE)": iterative_pipeline
}

for name, pipeline in strategies.items():
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="accuracy")
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

This example shows a realistic workflow: detect missing values, split before any preprocessing, build competing pipelines with different imputation strategies, and compare them through cross-validation. The best-performing strategy becomes your production pipeline.

Frequently Asked Questions

What is the best way to handle missing data in Python for machine learning?

There's no single best method — it depends on the missingness mechanism, the percentage of missing values, and your data type. For most practical cases, embedding a KNNImputer or IterativeImputer inside a scikit-learn Pipeline with ColumnTransformer gives you robust, leak-free imputation that generalizes well. Start with SimpleImputer(strategy="median") as a baseline, then try more sophisticated methods to see if they improve your specific metric.

Should I drop rows with missing values or impute them?

Dropping rows is only appropriate when the data is missing completely at random (MCAR) and the amount is very small (under 5%). Otherwise, dropping rows reduces your sample size and can introduce bias — the complete cases may not be representative of the full population. In most real-world scenarios, imputation preserves more information and produces better models.

How does pandas 3.0 handle missing values differently from pandas 2.x?

Pandas 3.0 enforces a stricter separation between np.nan (float-based) and pd.NA (type-agnostic). Nullable dtypes like Int64 and the new str dtype now consistently use pd.NA, which follows three-valued (Kleene) logic — meaning pd.NA == "a" returns pd.NA rather than False. The Copy-on-Write behavior also means imputation operations return new objects with lazy copies rather than modifying data in place. When piping data to scikit-learn, convert to NumPy arrays with np.nan for compatibility.

What is the difference between KNNImputer and IterativeImputer in scikit-learn?

KNNImputer estimates missing values by averaging the values from the k nearest neighbors in the feature space. It's intuitive and works well for small-to-medium datasets. IterativeImputer models each feature with missing values as a function of all other features using round-robin regression — it's more flexible, can capture complex feature interactions, and handles larger datasets better. KNNImputer is stable; IterativeImputer is still marked experimental but is widely used in practice.

How do I prevent data leakage when imputing missing values?

Always split your data into train and test sets before imputing. Fit the imputer on the training data only (fit_transform on train), then apply the same learned parameters to the test data (transform on test). The cleanest way to enforce this is by using scikit-learn Pipeline objects, which automatically handle the fit/transform separation during cross-validation and prediction.

Introduction: Why Missing Data Deserves More Than a Quick fillna()

Understanding Why Data Goes Missing: MCAR, MAR, and MNAR

Missing Completely At Random (MCAR)

Missing At Random (MAR)

Missing Not At Random (MNAR)

Detecting and Quantifying Missing Data with Pandas 3.0

Basic Detection

Pandas 3.0: pd.NA vs np.nan

Spotting Missingness Patterns

Simple Imputation with Pandas

Dropping Missing Values

Constant and Statistical Fill

Forward Fill and Backward Fill

Interpolation

Group-Specific Imputation

Scikit-Learn SimpleImputer: The Production-Ready Baseline

Preserving Missingness Information

KNNImputer: Leveraging Similarity Between Observations

How It Works Under the Hood

Key Parameters

When KNNImputer Works Best

IterativeImputer (MICE): Multivariate Imputation by Chained Equations

How the Iterative Process Works

Choosing the Estimator

Important: Experimental Status

Building Imputation Into Scikit-Learn Pipelines

Hyperparameter Tuning Across Imputation Strategies

A Practical Decision Framework

Based on Missingness Rate

Based on Data Type

Based on Missingness Mechanism

Common Pitfalls and How to Avoid Them

1. Imputing Before Splitting Your Data

2. Using Mean Imputation on Skewed Distributions

3. Forgetting to Scale Before KNNImputer

4. Ignoring the Missingness Pattern

5. Treating All Columns the Same Way

Full Working Example: End-to-End Imputation Pipeline

Frequently Asked Questions

What is the best way to handle missing data in Python for machine learning?

Should I drop rows with missing values or impute them?

How does pandas 3.0 handle missing values differently from pandas 2.x?

What is the difference between KNNImputer and IterativeImputer in scikit-learn?

How do I prevent data leakage when imputing missing values?

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)

Introduction: Why Missing Data Deserves More Than a Quick `fillna()`

Pandas 3.0: `pd.NA` vs `np.nan`