Feature Engineering in Python: scikit-l…

Most ML algorithms are sensitive to the scale and distribution of numerical features. Gradient-based models, distance-based models like k-NN and SVM, and regularized linear models — they all benefit substantially from proper numerical preprocessing. Let's explore the key transformations scikit-learn gives us.

Scaling: StandardScaler, MinMaxScaler, and RobustScaler

Scaling is the most common numerical transformation. Scikit-learn provides three primary scalers, each suited to different data characteristics:

StandardScaler — Centers data to zero mean and unit variance. Best for normally distributed features and when using algorithms that assume Gaussian inputs (e.g., logistic regression, SVM).
MinMaxScaler — Scales features to a fixed range, typically [0, 1]. Useful when you need bounded values or when working with neural networks.
RobustScaler — Uses the median and interquartile range instead of mean and standard deviation. This is your go-to choice when your data has outliers.

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import fetch_california_housing
import pandas as pd

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

# Compare the three scalers on a feature with outliers
scalers = {
    "StandardScaler": StandardScaler(),
    "MinMaxScaler": MinMaxScaler(),
    "RobustScaler": RobustScaler(),
}

for name, scaler in scalers.items():
    scaled = scaler.fit_transform(X[["MedInc"]])
    print(f"{name:>18s} — mean: {scaled.mean():.3f}, std: {scaled.std():.3f}, "
          f"min: {scaled.min():.3f}, max: {scaled.max():.3f}")

A practical rule of thumb: start with StandardScaler as your default. Switch to RobustScaler when you spot outliers during exploratory analysis, and reach for MinMaxScaler when your downstream model requires bounded inputs.

Log and Power Transforms with PowerTransformer

Many real-world features follow skewed distributions — income, population counts, prices. You know the type. Algorithms that assume normality will struggle with these. The PowerTransformer applies either a Yeo-Johnson or Box-Cox transformation to make features more Gaussian-like:

from sklearn.preprocessing import PowerTransformer
import numpy as np

# Yeo-Johnson works with both positive and negative values
pt = PowerTransformer(method="yeo-johnson", standardize=True)
X_transformed = pt.fit_transform(X[["Population", "MedInc"]])

print(f"Original skewness — Population: {X['Population'].skew():.2f}, "
      f"MedInc: {X['MedInc'].skew():.2f}")
print(f"Transformed skewness — Population: "
      f"{pd.Series(X_transformed[:, 0]).skew():.2f}, "
      f"MedInc: {pd.Series(X_transformed[:, 1]).skew():.2f}")

Use method="box-cox" when all values are strictly positive, and method="yeo-johnson" (the default) when your data might include zeros or negative values. The standardize=True parameter applies zero-mean, unit-variance standardization after the power transform.

Binning with KBinsDiscretizer

Sometimes converting a continuous feature into discrete bins captures non-linear relationships that a linear model would otherwise miss entirely. KBinsDiscretizer handles this cleanly:

from sklearn.preprocessing import KBinsDiscretizer

# Create 5 bins using quantile strategy (equal-frequency bins)
binner = KBinsDiscretizer(n_bins=5, encode="onehot-dense", strategy="quantile")
X_binned = binner.fit_transform(X[["MedInc"]])
print(f"Binned shape: {X_binned.shape}")  # (n_samples, 5)
print(f"Bin edges: {binner.bin_edges_[0].round(2)}")

The strategy parameter controls how bin edges are computed: "uniform" for equal-width bins, "quantile" for equal-frequency bins, and "kmeans" for bins based on k-means clustering. The encode parameter determines the output format — "onehot-dense", "onehot" (sparse), or "ordinal".

SplineTransformer for Non-Linear Relationships

The SplineTransformer (introduced in scikit-learn 1.0) generates B-spline basis features, letting linear models capture non-linear patterns without the combinatorial explosion you'd get from polynomial features:

from sklearn.preprocessing import SplineTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

# SplineTransformer + Ridge = a flexible non-linear model
model = make_pipeline(
    SplineTransformer(n_knots=5, degree=3, include_bias=False),
    Ridge(alpha=1.0)
)
model.fit(X[["Latitude"]], y)
print(f"R² score: {model.score(X[['Latitude']], y):.4f}")

Splines are particularly effective for features with known non-linear relationships — geographic coordinates, time-of-day effects, or any feature where the relationship changes direction at certain values. They produce far fewer features than high-degree polynomials while being more numerically stable. I've found them especially useful for geographic data where latitude/longitude have complex, non-monotonic effects on the target.

Categorical Feature Encoding

Categorical features require encoding before most algorithms can use them. And the choice of encoding strategy? It can make or break your model's performance, especially with high-cardinality features.

OrdinalEncoder

Use OrdinalEncoder when your categorical feature has a natural order — education levels, size categories, or ratings:

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

df = pd.DataFrame({
    "size": ["small", "medium", "large", "medium", "small", "large"],
    "priority": ["low", "medium", "high", "low", "high", "medium"]
})

encoder = OrdinalEncoder(
    categories=[["small", "medium", "large"], ["low", "medium", "high"]],
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)
encoded = encoder.fit_transform(df)
print(encoded)
# [[0. 0.]
#  [1. 1.]
#  [2. 2.]
#  ...]

Always specify the categories parameter explicitly for ordinal features to ensure the correct ordering. The handle_unknown="use_encoded_value" setting prevents errors when the test set contains categories not seen during training — a situation that's more common than you might expect.

OneHotEncoder

For nominal features without any inherent ordering, one-hot encoding creates binary columns for each category:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    sparse_output=False,
    drop="if_binary",      # drop one column for binary features
    handle_unknown="infrequent_if_exist",
    min_frequency=0.05,    # group rare categories as "infrequent"
)

df_cat = pd.DataFrame({
    "color": ["red", "blue", "green", "red", "blue", "blue", "yellow"],
    "is_active": [True, False, True, True, False, True, False],
})

encoded = encoder.fit_transform(df_cat)
print(f"Feature names: {encoder.get_feature_names_out()}")

Two parameters deserve special attention here. The min_frequency parameter (or alternatively max_categories) groups rare categories into an "infrequent" bucket — essential for controlling dimensionality with high-cardinality features. And handle_unknown="infrequent_if_exist" maps unseen test categories to this same infrequent bucket rather than raising an error.

TargetEncoder for High-Cardinality Features

TargetEncoder, stabilized in scikit-learn 1.3 and refined in subsequent releases, replaces each category with a smoothed estimate of the target variable's mean for that category. It's particularly valuable for high-cardinality features like zip codes, product IDs, or user agents — cases where one-hot encoding would create thousands of columns:

from sklearn.preprocessing import TargetEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
import numpy as np

# Simulate high-cardinality data
rng = np.random.default_rng(42)
n = 5000
df = pd.DataFrame({
    "zip_code": rng.choice([f"{z:05d}" for z in range(500)], size=n),
    "city": rng.choice(["NYC", "LA", "Chicago", "Houston", "Phoenix",
                         "Denver", "Seattle", "Boston", "Miami"], size=n),
})
y = rng.normal(100, 15, size=n)

# TargetEncoder with built-in cross-fitting to prevent leakage
te = TargetEncoder(smooth="auto", cv=5, target_type="continuous")
pipe = make_pipeline(te, Ridge())
scores = cross_val_score(pipe, df, y, cv=5, scoring="r2")
print(f"Mean R²: {scores.mean():.4f} (+/- {scores.std():.4f})")

The smooth="auto" parameter controls regularization — it balances the category-specific mean against the global mean, which is critical for categories with few observations. The internal cv parameter performs cross-fitting during fit_transform, reducing the risk of target leakage on the training set.

When to Use Each Encoder

Encoder	Best For	Watch Out For
`OrdinalEncoder`	Ordered categories, tree-based models	Implies ordering even for nominal data
`OneHotEncoder`	Nominal categories with low cardinality	Dimensionality explosion with many categories
`TargetEncoder`	High-cardinality features, any model type	Risk of target leakage if not using CV

Datetime and Cyclical Feature Engineering

Datetime columns are packed with useful information, but you can't feed them directly to ML models. The standard approach is to decompose them into meaningful numerical components — but naive extraction misses something important: time is cyclical.

Extracting Datetime Components

from sklearn.preprocessing import FunctionTransformer
import pandas as pd
import numpy as np

# Sample datetime data
dates = pd.DataFrame({
    "timestamp": pd.date_range("2024-01-01", periods=1000, freq="h")
})

def extract_datetime_features(df):
    """Extract multiple datetime components from a timestamp column."""
    df = df.copy()
    ts = df["timestamp"]
    return pd.DataFrame({
        "hour": ts.dt.hour,
        "day_of_week": ts.dt.dayofweek,
        "month": ts.dt.month,
        "day_of_year": ts.dt.dayofyear,
        "is_weekend": ts.dt.dayofweek.isin([5, 6]).astype(int),
    })

dt_extractor = FunctionTransformer(extract_datetime_features)
features = dt_extractor.fit_transform(dates)
print(features.head())

Cyclical Encoding with Sine/Cosine Transforms

Here's the problem with raw datetime components: hour 23 and hour 0 are adjacent in time but numerically distant. Same thing with December (12) and January (1). Cyclical encoding using sine and cosine transforms solves this elegantly:

def cyclical_datetime_features(df):
    """Extract cyclical datetime features using sin/cos encoding."""
    df = df.copy()
    ts = df["timestamp"]

    hour = ts.dt.hour
    dow = ts.dt.dayofweek
    month = ts.dt.month

    return pd.DataFrame({
        "hour_sin": np.sin(2 * np.pi * hour / 24),
        "hour_cos": np.cos(2 * np.pi * hour / 24),
        "dow_sin": np.sin(2 * np.pi * dow / 7),
        "dow_cos": np.cos(2 * np.pi * dow / 7),
        "month_sin": np.sin(2 * np.pi * month / 12),
        "month_cos": np.cos(2 * np.pi * month / 12),
        "is_weekend": ts.dt.dayofweek.isin([5, 6]).astype(int),
    })

cyclical_extractor = FunctionTransformer(cyclical_datetime_features)
cyclical_features = cyclical_extractor.fit_transform(dates)
print(cyclical_features.head())

Each cyclical feature is encoded as a pair of sine and cosine values. You need both — using only sine would make 6 AM and 6 PM indistinguishable. Together, the sine/cosine pair creates a unique position on the unit circle for each time value, and distances between adjacent times stay small regardless of where they fall in the cycle.

Non-cyclical features like is_weekend or year should be left as plain numerical values since they don't wrap around.

Interaction and Polynomial Features

Sometimes the relationship between features and the target isn't additive — the combination of two features carries information that neither provides alone. Interaction and polynomial features capture these relationships explicitly.

PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures

X_small = X[["MedInc", "AveRooms"]].head(5)

poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
X_poly = poly.fit_transform(X_small)
print(f"Original features: {X_small.shape[1]}")
print(f"Polynomial features: {X_poly.shape[1]}")
print(f"Feature names: {poly.get_feature_names_out()}")

With degree=2 and two input features, you get five output features: the two originals, their squares, and their interaction term. Set interaction_only=True to skip the squared terms and keep only the cross-product interactions.

Manual Interaction Features

Domain knowledge often points to specific interactions that matter. Rather than generating all possible polynomial combinations (which can get out of hand fast), you can engineer targeted interactions using FunctionTransformer:

def create_housing_interactions(X):
    """Create domain-specific interaction features for housing data."""
    df = pd.DataFrame(X, columns=["MedInc", "AveRooms", "AveOccup", "Latitude"])
    df["income_per_room"] = df["MedInc"] / df["AveRooms"].clip(lower=0.1)
    df["income_x_occupancy"] = df["MedInc"] * df["AveOccup"]
    df["rooms_per_occupant"] = df["AveRooms"] / df["AveOccup"].clip(lower=0.1)
    return df

interaction_creator = FunctionTransformer(create_housing_interactions)

When Polynomial Features Help vs. Hurt

Polynomial features are powerful but come with real costs:

They help when the true relationship is non-linear and you're using a linear model, when you have few features and plenty of data, and when you combine them with regularization (Ridge, Lasso) to control overfitting.
They hurt when you already have many features (the feature count grows combinatorially), when using tree-based models that naturally capture interactions, and when your dataset is small relative to the number of generated features.

A degree of 2 with d features produces O(d²) features. A degree of 3? O(d³). With 50 input features, degree-2 polynomials generate over 1,300 features. Always combine polynomial features with feature selection or regularization.

Text Feature Basics

While deep NLP is beyond our scope here, scikit-learn offers two solid vectorizers that convert text into numerical features suitable for traditional ML models.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

documents = [
    "Python is great for data science",
    "Scikit-learn makes machine learning easy",
    "Feature engineering improves model performance",
    "Python and scikit-learn for machine learning pipelines",
]

# CountVectorizer: raw term counts
count_vec = CountVectorizer(max_features=20, stop_words="english")
X_count = count_vec.fit_transform(documents)
print(f"Vocabulary: {count_vec.get_feature_names_out()}")

# TfidfVectorizer: term frequency-inverse document frequency
tfidf_vec = TfidfVectorizer(
    max_features=50,
    ngram_range=(1, 2),   # unigrams and bigrams
    min_df=1,
    max_df=0.95,
    stop_words="english",
)
X_tfidf = tfidf_vec.fit_transform(documents)
print(f"TF-IDF shape: {X_tfidf.shape}")

TfidfVectorizer is generally preferred over CountVectorizer because TF-IDF down-weights terms that appear across many documents, giving more importance to discriminative terms. Key parameters to tune: max_features (vocabulary size cap), ngram_range (unigrams, bigrams, or trigrams), and min_df / max_df (document frequency thresholds).

These vectorizers integrate seamlessly into ColumnTransformer, as we'll see next, making it straightforward to combine text features with numerical and categorical features in a single pipeline.

Building Complete Pipelines with ColumnTransformer

This is where everything comes together. ColumnTransformer lets you apply different transformations to different subsets of columns, and Pipeline chains these transformations with a final estimator. Together, they create a single object that wraps your entire preprocessing and modeling workflow.

The Core Pattern

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, PowerTransformer, SplineTransformer
)
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_openml
import pandas as pd

# Load the Ames Housing dataset — a rich dataset with mixed types
ames = fetch_openml(name="house_prices", as_frame=True, parser="auto")
X = ames.data
y = ames.target

# Identify column types
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()

print(f"Numerical columns: {len(numerical_cols)}")
print(f"Categorical columns: {len(categorical_cols)}")

Using make_column_selector

Rather than manually listing columns, make_column_selector selects columns by dtype at pipeline fit time. This makes your pipeline resilient to column ordering changes:

num_selector = make_column_selector(dtype_include=["int64", "float64"])
cat_selector = make_column_selector(dtype_include=["object", "category"])

The remainder Parameter

The remainder parameter on ColumnTransformer controls what happens to columns you haven't explicitly assigned to any transformer:

"drop" (default) — silently discards unassigned columns
"passthrough" — passes unassigned columns through unchanged
A transformer instance — applies that transformer to all remaining columns

Pandas Output with set_output

Since scikit-learn 1.2, you can configure transformers to output pandas DataFrames instead of NumPy arrays. This is genuinely invaluable for debugging and interpretability:

from sklearn import set_config
set_config(transform_output="pandas")  # Global setting

# Or per-transformer:
# scaler = StandardScaler().set_output(transform="pandas")

Full Pipeline + ColumnTransformer Example

So, let's put it all together. Here's a complete, production-style pipeline that combines numerical scaling, power transforms, categorical encoding, and a final estimator:

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, PowerTransformer, TargetEncoder
)
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.datasets import fetch_openml
import numpy as np

# Load dataset
ames = fetch_openml(name="house_prices", as_frame=True, parser="auto")
X = ames.data
y = ames.target

# Split before any fitting
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Identify column groups by dtype
num_cols = X_train.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_low_card = [
    col for col in X_train.select_dtypes(include=["object", "category"]).columns
    if X_train[col].nunique() <= 10
]
cat_high_card = [
    col for col in X_train.select_dtypes(include=["object", "category"]).columns
    if X_train[col].nunique() > 10
]

# Define sub-pipelines for each column group
numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("power", PowerTransformer(method="yeo-johnson", standardize=True)),
])

categorical_low_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(
        handle_unknown="infrequent_if_exist",
        min_frequency=0.02,
        sparse_output=False,
    )),
])

categorical_high_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("target_enc", TargetEncoder(smooth="auto", cv=5)),
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipeline, num_cols),
        ("cat_low", categorical_low_pipeline, cat_low_card),
        ("cat_high", categorical_high_pipeline, cat_high_card),
    ],
    remainder="drop",
    verbose_feature_names_out=True,
)

# Full pipeline: preprocessing + model
full_pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", Ridge(alpha=10.0)),
])

# Cross-validate on training data
scores = cross_val_score(
    full_pipeline, X_train, y_train, cv=5, scoring="neg_mean_absolute_error"
)
print(f"MAE: {-scores.mean():.2f} (+/- {scores.std():.2f})")

# Fit and evaluate on held-out test set
full_pipeline.fit(X_train, y_train)
test_score = full_pipeline.score(X_test, y_test)
print(f"Test R²: {test_score:.4f}")

This pipeline handles missing values, applies appropriate transformations based on feature type, and ensures no data leakage — fit is only ever called on training data. The entire pipeline can be serialized, versioned, and deployed as a single object.

Custom Transformers with FunctionTransformer and BaseEstimator

Scikit-learn's built-in transformers cover the most common cases, but real-world projects almost always need custom transformations. There are two approaches: FunctionTransformer for stateless transforms and custom classes inheriting from BaseEstimator and TransformerMixin for stateful ones.

Stateless Transforms with FunctionTransformer

If your transformation doesn't need to learn anything from the training data (no fit step), FunctionTransformer is the simplest approach:

from sklearn.preprocessing import FunctionTransformer
import numpy as np
import pandas as pd

# Log transform with handling for zeros
log_transformer = FunctionTransformer(
    func=lambda X: np.log1p(np.maximum(X, 0)),
    inverse_func=lambda X: np.expm1(X),
    feature_names_out="one-to-one",
)

# Ratio features
def create_ratios(X):
    df = pd.DataFrame(X, columns=["rooms", "bedrooms", "population", "households"])
    df["rooms_per_household"] = df["rooms"] / df["households"].clip(lower=1)
    df["bedrooms_ratio"] = df["bedrooms"] / df["rooms"].clip(lower=1)
    df["people_per_household"] = df["population"] / df["households"].clip(lower=1)
    return df

ratio_transformer = FunctionTransformer(create_ratios)

The feature_names_out parameter tells scikit-learn how to generate output feature names: "one-to-one" keeps the input names, or you can pass a callable that returns custom names.

Stateful Transformers with BaseEstimator

When your transformation needs to learn parameters from training data — computing thresholds, mappings, or statistics — you need a full custom transformer:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd

class OutlierClipper(BaseEstimator, TransformerMixin):
    """Clips outliers based on IQR computed from training data."""

    def __init__(self, factor=1.5):
        self.factor = factor

    def fit(self, X, y=None):
        X_arr = np.asarray(X, dtype=float)
        q1 = np.nanpercentile(X_arr, 25, axis=0)
        q3 = np.nanpercentile(X_arr, 75, axis=0)
        iqr = q3 - q1
        self.lower_bound_ = q1 - self.factor * iqr
        self.upper_bound_ = q3 + self.factor * iqr
        self.n_features_in_ = X_arr.shape[1]
        return self

    def transform(self, X):
        X_arr = np.array(X, dtype=float)
        return np.clip(X_arr, self.lower_bound_, self.upper_bound_)

    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return np.array([f"x{i}" for i in range(self.n_features_in_)])
        return np.asarray(input_features)

# Usage in a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

robust_pipeline = Pipeline([
    ("clip_outliers", OutlierClipper(factor=1.5)),
    ("scale", StandardScaler()),
])

Key rules for custom transformers:

Inherit from both BaseEstimator and TransformerMixin.
All constructor parameters must be stored as attributes with the exact same names — this enables get_params() and set_params() for grid search.
Learned attributes must end with an underscore (e.g., lower_bound_) — this is a scikit-learn convention that tools like check_is_fitted rely on.
The fit method must return self.
Implement get_feature_names_out for compatibility with pandas output and pipeline introspection.

A More Complex Example: Frequency Encoder

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """Encodes categorical features by their frequency in the training set."""

    def __init__(self, normalize=True):
        self.normalize = normalize

    def fit(self, X, y=None):
        X_df = pd.DataFrame(X)
        self.frequency_maps_ = {}
        for col in X_df.columns:
            counts = X_df[col].value_counts(normalize=self.normalize)
            self.frequency_maps_[col] = counts.to_dict()
        self.n_features_in_ = X_df.shape[1]
        self.feature_names_in_ = np.array(X_df.columns)
        return self

    def transform(self, X):
        X_df = pd.DataFrame(X)
        result = pd.DataFrame()
        for col in X_df.columns:
            freq_map = self.frequency_maps_.get(col, {})
            default = 0.0 if self.normalize else 0
            result[col] = X_df[col].map(freq_map).fillna(default)
        return result.values

    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            return self.feature_names_in_
        return np.asarray(input_features)

Feature Selection After Engineering

Feature engineering can dramatically increase your feature count. After generating polynomial interactions, one-hot encoded categories, and spline basis functions, you might end up with hundreds or thousands of features — many of them redundant or irrelevant. Feature selection trims the fat.

VarianceThreshold

The simplest filter: remove features with variance below a threshold. This catches constant or near-constant features that carry no useful information:

from sklearn.feature_selection import VarianceThreshold

# Remove features with near-zero variance
# threshold=0.01 means the feature must have variance > 0.01
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X_engineered)
print(f"Features before: {X_engineered.shape[1]}, after: {X_selected.shape[1]}")

SelectKBest with Statistical Tests

SelectKBest uses univariate statistical tests to select the top k features most associated with the target:

from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

# For regression: f_regression (linear) or mutual_info_regression (non-linear)
selector_f = SelectKBest(score_func=f_regression, k=20)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=20)

# For classification: f_classif or mutual_info_classif
# selector = SelectKBest(score_func=f_classif, k=20)

Use f_regression or f_classif when you expect linear relationships. Use mutual_info_regression or mutual_info_classif when relationships may be non-linear. Mutual information captures any kind of dependency but is computationally more expensive and has higher variance with small samples.

Model-Based Feature Selection

Feature importance scores from tree-based models or L1-regularized models provide another powerful selection mechanism:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LassoCV

# Using a tree model's feature importances
tree_selector = SelectFromModel(
    GradientBoostingRegressor(n_estimators=100, random_state=42),
    threshold="median",  # keep features above median importance
)

# Using Lasso's coefficient magnitudes
lasso_selector = SelectFromModel(
    LassoCV(cv=5, random_state=42),
    threshold=1e-5,  # features with non-zero coefficients
)

Integrating Feature Selection into the Pipeline

Feature selection belongs inside the pipeline, after preprocessing and before the final model. This ensures that selection is based only on training data during cross-validation:

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.linear_model import Ridge

full_pipeline_with_selection = Pipeline([
    ("preprocess", preprocessor),                          # ColumnTransformer
    ("select", SelectKBest(mutual_info_regression, k=30)), # Feature selection
    ("model", Ridge(alpha=10.0)),                          # Final model
])

scores = cross_val_score(
    full_pipeline_with_selection, X_train, y_train,
    cv=5, scoring="neg_mean_absolute_error"
)
print(f"MAE with selection: {-scores.mean():.2f}")

The k parameter in SelectKBest (or the threshold in SelectFromModel) becomes a tunable hyperparameter you can optimize with GridSearchCV or RandomizedSearchCV:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "select__k": [10, 20, 30, 50],
    "model__alpha": [0.1, 1.0, 10.0, 100.0],
}

search = GridSearchCV(
    full_pipeline_with_selection,
    param_grid,
    cv=5,
    scoring="neg_mean_absolute_error",
    n_jobs=-1,
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best MAE: {-search.best_score_:.2f}")

Best Practices and Common Pitfalls

Building feature engineering pipelines isn't just about knowing the transformers — it's about avoiding subtle bugs that can silently invalidate your results. Here are the practices that separate reliable pipelines from leaky ones.

Data Leakage: The Silent Killer

Data leakage happens when information from the test set (or the future, in time-series contexts) bleeds into your training process. Feature engineering is one of the most common sources, and I've seen it trip up even experienced practitioners:

Scaling before splitting — If you fit StandardScaler on the entire dataset before splitting into train and test, the scaler learns the test set's mean and variance. The test set is no longer truly unseen.
Target encoding without cross-fitting — Computing target means on the full training set and using those values on the same training samples creates optimistic estimates.
Feature selection on the full dataset — Selecting features based on the entire dataset's correlation with the target leaks test set information into the feature set.

The fix is always the same: put everything inside the pipeline. When you call cross_val_score(pipeline, X, y), scikit-learn ensures that fit is only called on the training folds.

# WRONG: leakage
X_scaled = StandardScaler().fit_transform(X)  # fit on ALL data
X_train, X_test = train_test_split(X_scaled, ...)

# RIGHT: no leakage
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge()),
])
pipeline.fit(X_train, y_train)      # scaler fit only on X_train
pipeline.predict(X_test)            # scaler transforms X_test using train stats

Always Fit on Train, Transform on Everything

This is the cardinal rule of scikit-learn preprocessing. Every transformer has two modes:

fit_transform(X_train) — Learn parameters from training data and transform it.
transform(X_test) — Apply the already-learned parameters to new data.

Pipelines enforce this automatically. When you call pipeline.fit(X_train, y_train), each step calls fit_transform in sequence. When you call pipeline.predict(X_test), each step calls only transform. That's why pipelines are so much safer than manual step-by-step preprocessing.

Pipeline Serialization and Deployment

A major advantage of pipelines is that the entire feature engineering and modeling workflow serializes as a single object:

import joblib

# Save the complete pipeline
joblib.dump(full_pipeline, "housing_model_pipeline.joblib")

# Load and predict — no separate preprocessing needed
loaded_pipeline = joblib.load("housing_model_pipeline.joblib")
predictions = loaded_pipeline.predict(new_data)

When deploying models, this eliminates the fragile (and honestly, terrifying) pattern of maintaining separate preprocessing code in your serving layer. The pipeline is the preprocessing code.

If your pipeline uses custom transformers like our OutlierClipper, make sure the custom class is importable in the deployment environment. A common pattern is to package your custom transformers in a small Python module that ships alongside the serialized pipeline.

Testing Your Pipelines

Pipelines are code, and code should be tested. Here are the key checks to include:

from sklearn.utils.estimator_checks import parametrize_with_checks
from sklearn.utils.validation import check_is_fitted

# 1. Verify the pipeline fits without error
full_pipeline.fit(X_train, y_train)

# 2. Verify it is fitted
check_is_fitted(full_pipeline)

# 3. Verify output shape is consistent
X_transformed_train = full_pipeline[:-1].transform(X_train)
X_transformed_test = full_pipeline[:-1].transform(X_test)
assert X_transformed_train.shape[1] == X_transformed_test.shape[1], \
    "Train and test feature counts differ!"

# 4. Verify no NaN values in output
import numpy as np
X_out = full_pipeline[:-1].transform(X_test)
assert not np.any(np.isnan(X_out)), "Pipeline output contains NaN values!"

# 5. Verify predictions are reasonable
predictions = full_pipeline.predict(X_test)
assert predictions.shape[0] == X_test.shape[0]
assert np.all(np.isfinite(predictions)), "Predictions contain non-finite values!"

For custom transformers, scikit-learn provides parametrize_with_checks which runs a battery of standard estimator tests. These catch common issues like mutating input data, failing on edge cases, or not handling sparse inputs correctly.

Debugging with Pipeline Introspection

When things go wrong (and they will), you need to inspect intermediate pipeline stages. Scikit-learn makes this possible via indexing and named step access:

# Access a specific step by name
preprocessor_step = full_pipeline.named_steps["preprocess"]

# Access a specific transformer within ColumnTransformer
num_transformer = preprocessor_step.named_transformers_["num"]

# Transform data up to a specific step (exclude the model)
X_preprocessed = full_pipeline[:-1].transform(X_test)

# Get feature names from the preprocessor
feature_names = preprocessor_step.get_feature_names_out()
print(f"Total engineered features: {len(feature_names)}")

Performance Considerations

A few final tips for keeping your pipelines efficient:

Use sparse output when possible. If your one-hot encoder produces many columns, keeping sparse_output=True can save significant memory. However, not all downstream models handle sparse matrices — check compatibility first.
Cache expensive steps. Pipeline accepts a memory parameter for caching transformer outputs. If you're running grid search over model parameters with a fixed preprocessing pipeline, caching avoids redundant preprocessing: Pipeline([...], memory="cache_dir").
Profile before optimizing. Use %%timeit to identify bottlenecks. Often the slowest step isn't where you expect — TargetEncoder with high CV counts or mutual_info_regression with many features can be surprises.
Consider n_jobs=-1 on components that support parallelism, such as GridSearchCV and certain feature selectors.

Summary Checklist

Before shipping your feature engineering pipeline, run through these items:

All preprocessing is inside the pipeline — no fit on test data.
Categorical encoders handle unknown categories gracefully (handle_unknown parameter).
Missing values are handled before transformations that can't accept them.
High-cardinality categoricals use TargetEncoder or frequency encoding instead of one-hot.
Feature selection is inside the pipeline and tuned as a hyperparameter.
The pipeline is tested on held-out data with shape and NaN assertions.
Custom transformers follow scikit-learn conventions (underscore attributes, return self).
The pipeline serializes and deserializes correctly with joblib.

Feature engineering is both an art and a science. The science lies in understanding the mathematical transformations — scaling, encoding, interaction terms, feature selection. The art? Knowing which transformations to apply for your specific dataset and domain. Scikit-learn's pipeline infrastructure gives you the tools to express both systematically, reproducibly, and safely. Start with the patterns in this guide, iterate based on cross-validation results, and let the data tell you what works.

Numerical Feature Transformations

Scaling: StandardScaler, MinMaxScaler, and RobustScaler

Log and Power Transforms with PowerTransformer

Binning with KBinsDiscretizer

SplineTransformer for Non-Linear Relationships

Categorical Feature Encoding

OrdinalEncoder

OneHotEncoder

TargetEncoder for High-Cardinality Features

When to Use Each Encoder

Datetime and Cyclical Feature Engineering

Extracting Datetime Components

Cyclical Encoding with Sine/Cosine Transforms

Interaction and Polynomial Features

PolynomialFeatures

Manual Interaction Features

When Polynomial Features Help vs. Hurt

Text Feature Basics

Building Complete Pipelines with ColumnTransformer

The Core Pattern

Using make_column_selector

The remainder Parameter

Pandas Output with set_output

Full Pipeline + ColumnTransformer Example

Custom Transformers with FunctionTransformer and BaseEstimator

Stateless Transforms with FunctionTransformer

Stateful Transformers with BaseEstimator

A More Complex Example: Frequency Encoder

Feature Selection After Engineering

VarianceThreshold

SelectKBest with Statistical Tests

Model-Based Feature Selection

Integrating Feature Selection into the Pipeline

Best Practices and Common Pitfalls

Data Leakage: The Silent Killer

Always Fit on Train, Transform on Everything

Pipeline Serialization and Deployment

Testing Your Pipelines

Debugging with Pipeline Introspection

Performance Considerations

Summary Checklist

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)