SMOTE & Imbalanced Classification Pytho…

Class imbalance is the silent killer of production machine learning models. A fraud classifier that hits 99.3% accuracy on a dataset where only 0.7% of transactions are fraudulent has likely learned a pretty useless trick: predict "not fraud" every single time. And honestly, in 2026 — with fraud detection, churn prediction, rare disease screening, and anomaly detection driving more ML use cases than ever — knowing how to handle skewed labels is just non-negotiable.

So, this guide covers the complete imbalanced classification toolkit in Python: SMOTE and its variants, ADASYN, BorderlineSMOTE, class weighting, threshold tuning, and the right evaluation metrics. Every technique includes runnable code using imbalanced-learn 0.14, scikit-learn 1.8, and numpy 2.x.

I've shipped a few imbalanced models into production over the years (one for transaction fraud, one for early-warning churn), and the lessons below are the ones I keep coming back to.

What Is Class Imbalance and Why Standard Models Fail

A dataset is imbalanced when one class dominates. The ratio matters here: a 60/40 split is mildly skewed and most algorithms tolerate it just fine. A 99/1 split for credit card fraud or rare diseases? That's severe, and it breaks the default training assumptions of most classifiers.

Models optimize a global loss function, and when 99% of the loss can be eliminated by ignoring the minority class, that's exactly what they learn to do. (You can almost picture the optimizer shrugging and taking the easy win.)

Three failure modes show up in production:

Misleading accuracy. A model predicting only the majority class scores 99% on a 99/1 dataset but has zero recall on the class you actually care about.
Decision boundary bias. Classifiers like logistic regression and SVMs push the boundary toward the minority class because moving it costs less loss.
Threshold mismatch. The default 0.5 probability cutoff is calibrated for balanced data. On skewed data, the optimal cutoff is often well below 0.5 — sometimes shockingly so.

Setting Up the Environment

Install the latest stack. imbalanced-learn 0.14 ships in 2026 with full scikit-learn 1.8 compatibility and free-threaded Python support, which is honestly a huge quality-of-life win.

pip install "scikit-learn>=1.8" "imbalanced-learn>=0.14" "numpy>=2.0" pandas matplotlib

Now, let's create a synthetic imbalanced dataset for the rest of this guide:

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=10_000,
    n_features=20,
    n_informative=5,
    n_redundant=2,
    weights=[0.98, 0.02],   # 98/2 imbalance
    flip_y=0.01,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

print("Train class distribution:", Counter(y_train))
# Counter({0: 7350, 1: 150})

Note the stratify=y argument — don't skip this. Always stratify splits on the label column for imbalanced data, since random splits can leave the test set with too few minority examples to evaluate reliably (and you'll spend an afternoon wondering why your metrics jump around between runs).

Approach 1: SMOTE — Synthetic Minority Oversampling

SMOTE generates new minority-class examples by interpolating between existing minority points and their k-nearest neighbors. Unlike naive duplication, the synthesized samples occupy new regions of feature space, which gives classifiers more signal to learn from.

from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

smote = SMOTE(sampling_strategy="auto", k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

print("After SMOTE:", Counter(y_res))
# Counter({0: 7350, 1: 7350})

model = LogisticRegression(max_iter=1000)
model.fit(X_res, y_res)
print(classification_report(y_test, model.predict(X_test), digits=3))

Critical Rule: Resample Inside Cross-Validation

Okay, this is the single most common SMOTE mistake I see, and I've made it myself: fitting SMOTE on the entire training set before cross-validation. That leaks synthetic examples into validation folds and inflates scores in a way that looks great on your local machine and terrible in production.

The correct pattern is an imblearn.pipeline.Pipeline (not sklearn.pipeline.Pipeline — they're different) that applies SMOTE only to the training fold:

from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="f1")
print(f"F1 (mean ± std): {scores.mean():.3f} ± {scores.std():.3f}")

SMOTENC for Mixed Categorical Features

Vanilla SMOTE produces fractional values, which makes exactly zero sense for one-hot encoded categories (what does "0.4 of category Red" even mean?). Use SMOTENC when your dataset mixes numerical and categorical columns:

from imblearn.over_sampling import SMOTENC

# indices of categorical columns
cat_features = [3, 7, 11]
smote_nc = SMOTENC(categorical_features=cat_features, random_state=42)
X_res, y_res = smote_nc.fit_resample(X_train, y_train)

Approach 2: ADASYN — Adaptive Synthetic Sampling

ADASYN generates more synthetic samples for minority points that sit near the decision boundary — basically, the ones the classifier finds hardest to learn. Where SMOTE treats every minority example equally, ADASYN concentrates effort where it matters.

from imblearn.over_sampling import ADASYN

adasyn = ADASYN(sampling_strategy="auto", n_neighbors=5, random_state=42)
X_res, y_res = adasyn.fit_resample(X_train, y_train)
print("After ADASYN:", Counter(y_res))

When ADASYN backfires: if your minority class contains noise or outliers, ADASYN amplifies them — generating more synthetic points around the worst examples. Run outlier removal (Isolation Forest, LOF) before ADASYN if your data is dirty. I learned this the hard way on a churn dataset where a handful of mislabeled rows tanked recall by about 8 points after "improving" the sampling.

Approach 3: BorderlineSMOTE — Focus on the Boundary

BorderlineSMOTE classifies each minority sample as safe (all nearest neighbors are minority), in danger (mixed neighbors), or noise (all neighbors are majority). Synthetic samples are generated only from the "in danger" set, which sharpens the decision boundary without bloating safe regions.

from imblearn.over_sampling import BorderlineSMOTE

borderline = BorderlineSMOTE(kind="borderline-1", random_state=42)
X_res, y_res = borderline.fit_resample(X_train, y_train)

BorderlineSMOTE consistently outperforms vanilla SMOTE when class overlap is significant — for example, in customer churn problems where churners and non-churners share many feature values. It's my default starting point for messy real-world tabular data.

Approach 4: class_weight — No Resampling Needed

Resampling changes your data; class_weight changes the loss function instead. Most scikit-learn classifiers accept class_weight="balanced", which automatically inversely weights classes by their frequency. It's about as low-effort as it gets.

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

rf = RandomForestClassifier(class_weight="balanced", n_estimators=300, random_state=42)
lr = LogisticRegression(class_weight="balanced", max_iter=1000)

# Or pass custom weights
custom = LogisticRegression(class_weight={0: 1, 1: 50}, max_iter=1000)

Use class_weight when:

Your dataset is large enough that resampling is expensive
You want a deterministic pipeline without synthetic data
You're using tree-based models — XGBoost (scale_pos_weight), LightGBM (is_unbalance), and CatBoost (auto_class_weights) all expose equivalent knobs

from xgboost import XGBClassifier

# scale_pos_weight = (negatives / positives)
neg, pos = Counter(y_train)[0], Counter(y_train)[1]
xgb = XGBClassifier(
    scale_pos_weight=neg / pos,
    n_estimators=400,
    learning_rate=0.05,
    eval_metric="aucpr",
    random_state=42,
)
xgb.fit(X_train, y_train)

Approach 5: Threshold Tuning — The Underused Lever

Even after balancing, the default 0.5 probability cutoff is rarely optimal on imbalanced data. Threshold tuning is essentially free — it doesn't retrain anything — and frequently buys 5 to 15 points of F1. Honestly, this is the single highest-ROI thing you can do, and most practitioners skip it.

import numpy as np
from sklearn.metrics import precision_recall_curve

model = LogisticRegression(class_weight="balanced", max_iter=1000)
model.fit(X_train, y_train)

y_probs = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)

# F1 for each threshold (avoid div-by-zero)
f1_scores = 2 * precision * recall / (precision + recall + 1e-12)
best_idx = np.argmax(f1_scores[:-1])  # last point has no threshold
best_threshold = thresholds[best_idx]

print(f"Optimal threshold: {best_threshold:.3f}")
print(f"F1 at default 0.5: {f1_scores[np.argmin(np.abs(thresholds - 0.5))]:.3f}")
print(f"F1 at optimal:     {f1_scores[best_idx]:.3f}")

TunedThresholdClassifierCV (scikit-learn 1.5+)

Scikit-learn now ships a built-in cross-validated threshold tuner that wraps any classifier. No more hand-rolling the loop above:

from sklearn.model_selection import TunedThresholdClassifierCV

base = LogisticRegression(class_weight="balanced", max_iter=1000)
tuned = TunedThresholdClassifierCV(base, scoring="f1", cv=5, random_state=42)
tuned.fit(X_train, y_train)

print(f"Selected threshold: {tuned.best_threshold_:.3f}")
print(classification_report(y_test, tuned.predict(X_test), digits=3))

Choosing the Right Evaluation Metric

Stop reporting accuracy on imbalanced problems. Just stop. The metric you optimize defines what your model learns, and the wrong metric guarantees the wrong model.

Metric	Use When	sklearn Function
F1 score	You need a balance of precision and recall on the minority class	`f1_score`
PR-AUC	Severe imbalance, ranking quality matters more than a single threshold	`average_precision_score`
ROC-AUC	Mild imbalance or when both classes matter equally	`roc_auc_score`
Balanced accuracy	Multi-class with skew, want a simple symmetric metric	`balanced_accuracy_score`
MCC	You want a single number that penalizes all four confusion-matrix cells	`matthews_corrcoef`

For severe imbalance (1% or less), prefer PR-AUC over ROC-AUC. ROC curves look optimistic on imbalanced data because the false positive rate stays low when negatives dominate, even when the model is weak on the minority class. (This is one of those quirks that catches people off guard the first time.)

from sklearn.metrics import (
    average_precision_score, roc_auc_score,
    balanced_accuracy_score, matthews_corrcoef
)

y_probs = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

print(f"PR-AUC:           {average_precision_score(y_test, y_probs):.3f}")
print(f"ROC-AUC:          {roc_auc_score(y_test, y_probs):.3f}")
print(f"Balanced acc:     {balanced_accuracy_score(y_test, y_pred):.3f}")
print(f"MCC:              {matthews_corrcoef(y_test, y_pred):.3f}")

Combining Techniques: Resample + Clean + Weight

The strongest results usually come from combining oversampling with undersampling to clean the resulting dataset. SMOTEENN applies SMOTE then removes ambiguous points using Edited Nearest Neighbors. SMOTETomek follows SMOTE with Tomek-link removal to clean class boundaries.

from imblearn.combine import SMOTEENN, SMOTETomek

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("resample", SMOTEENN(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

Decision Framework: Which Approach Should You Pick?

There's no universal best technique here — anyone who tells you otherwise is selling something. Use this decision tree:

Try class_weight="balanced" first. It's the simplest baseline and often within 2-3% F1 of any oversampling method. If it works, you're done — go take a walk.
If using XGBoost / LightGBM / CatBoost, tune scale_pos_weight or equivalent. These models handle imbalance natively better than you might expect.
If you have moderate imbalance (10:1 to 50:1) and clean data, use SMOTE inside an imblearn Pipeline.
If class overlap is the main problem, use BorderlineSMOTE.
If you have noisy minority points, avoid ADASYN; try SMOTEENN or clean outliers first.
If you have categorical features, use SMOTENC.
Always tune the decision threshold on a validation fold, regardless of which approach you picked above.

Anti-Patterns to Avoid

Resampling the test set. Never. Not even once. The test set must reflect production distribution.
Resampling before train/test split. Synthetic points leak into the test set, and your metrics become a fairy tale.
Using sklearn.pipeline.Pipeline with SMOTE. It calls fit_resample on transform, which is wrong. Use imblearn.pipeline.Pipeline.
Reporting accuracy on a 99/1 dataset. A model that always predicts the majority class will look like 99% genius.
Applying SMOTE to time-series classification. Synthetic interpolation breaks temporal order. Use cost-sensitive learning instead.
Stacking SMOTE with high-dimensional sparse features. Distance-based interpolation degrades in high dimensions. Reduce dimensionality first or skip resampling altogether.

Frequently Asked Questions

When should I use SMOTE vs class_weight?

Start with class_weight="balanced" — it's one line of code and adds no risk of data leakage. Switch to SMOTE when class_weight underperforms and your dataset is small enough (under ~1M rows) that synthesizing samples is computationally cheap. For large datasets, class_weight or built-in classifier parameters (scale_pos_weight) are usually the better call.

Does SMOTE work for multi-class problems?

Yes. imbalanced-learn applies SMOTE pairwise, oversampling each minority class independently. Set sampling_strategy="auto" to balance every class to the majority, or pass a dict to set custom targets per class: sampling_strategy={1: 500, 2: 500}.

Should I apply SMOTE before or after train/test split?

Always after — and only on the training set. Applying SMOTE before splitting leaks synthetic points into the test set, which wildly inflates evaluation scores. Inside cross-validation, use imblearn.pipeline.Pipeline so SMOTE is refit on each training fold.

Why is my SMOTE model performing worse than the baseline?

Three common reasons: (1) you fit SMOTE before splitting, so the test set is contaminated and the baseline is artificially high; (2) your minority class contains outliers that SMOTE amplified — try BorderlineSMOTE or remove outliers first; (3) you're evaluating with accuracy instead of F1 or PR-AUC, so the metric just doesn't reflect actual gains on the minority class.

Is SMOTE still relevant in 2026 with modern gradient boosting?

Less than it used to be, honestly. XGBoost, LightGBM, and CatBoost all handle moderate imbalance well via scale_pos_weight or focal loss, often matching or beating SMOTE without the data manipulation. SMOTE still helps for severe imbalance (1% or less), small datasets where every signal counts, and linear models that lack built-in weight handling.

Key Takeaways

Class imbalance breaks default training assumptions — fix the metric first, then the data, then the threshold.
class_weight="balanced" is the cheapest baseline and often the strongest. Always try it first.
SMOTE, ADASYN, and BorderlineSMOTE each shine in different scenarios — match the technique to your data shape.
Always resample inside cross-validation using imblearn.pipeline.Pipeline, never before the split.
Threshold tuning on the predicted probabilities is the highest-leverage, lowest-cost lever you have.
Use F1, PR-AUC, or MCC for evaluation. Accuracy is a trap on imbalanced data — and it'll bite you in production.

What Is Class Imbalance and Why Standard Models Fail

Setting Up the Environment

Approach 1: SMOTE — Synthetic Minority Oversampling

Critical Rule: Resample Inside Cross-Validation

SMOTENC for Mixed Categorical Features

Approach 2: ADASYN — Adaptive Synthetic Sampling

Approach 3: BorderlineSMOTE — Focus on the Boundary

Approach 4: class_weight — No Resampling Needed

Approach 5: Threshold Tuning — The Underused Lever

TunedThresholdClassifierCV (scikit-learn 1.5+)

Choosing the Right Evaluation Metric

Combining Techniques: Resample + Clean + Weight

Decision Framework: Which Approach Should You Pick?

Anti-Patterns to Avoid

Frequently Asked Questions

When should I use SMOTE vs class_weight?

Does SMOTE work for multi-class problems?

Should I apply SMOTE before or after train/test split?

Why is my SMOTE model performing worse than the baseline?

Is SMOTE still relevant in 2026 with modern gradient boosting?

Key Takeaways

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)