Scikit-Learn 1.8 Cross-Validation Guide…

Choosing the right cross-validation strategy can make or break your model evaluation. I've seen plenty of cases where a model looked stellar on paper — great accuracy, impressive F1 — only to fall apart in production. And more often than not, the culprit was a poorly chosen validation approach.

Standard K-Fold works fine for a lot of problems. But the moment your data has class imbalance, grouped observations, or a time dimension, you need to be more deliberate. Otherwise, you're basically lying to yourself about how well your model generalizes.

This guide walks through every cross-validation splitter available in scikit-learn 1.8, explains when each one makes sense, and shows you how to avoid the most common pitfalls — data leakage, random-state traps, and those sneaky preprocessing mistakes that silently inflate your scores.

Why Cross-Validation Matters

Training a model and evaluating it on the same data is a rookie mistake. A model that memorizes its training labels can score 100% on training data while completely failing on unseen samples. That's overfitting, and cross-validation is the standard defense against it.

Instead of a single train/test split — which is painfully sensitive to how the data happens to be divided — cross-validation repeats the evaluation across multiple splits. Each data point gets a turn in the test set, and the final score is an average across all folds.

The result? A more reliable, less noisy estimate of real-world performance.

The Split-Apply-Combine Pattern

Every cross-validation strategy in scikit-learn follows the same basic pattern:

Split the dataset into training and validation folds
Apply the model — fit on training, predict on validation
Combine the per-fold scores into a single performance estimate

The strategies only differ in how they create those splits. And that decision depends entirely on the structure of your data.

K-Fold Cross-Validation: The Baseline

KFold splits the dataset into k consecutive, equal-sized folds. Each fold takes a turn as the test set while the remaining k-1 folds form the training set. With shuffle=True, the data gets randomly shuffled before splitting — and honestly, that's almost always what you want for non-time-series data.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

When to use it: Regression tasks or balanced classification problems where observations are independent and identically distributed (i.i.d.).

When to avoid it: Imbalanced classification, grouped data, or anything with a time component.

Stratified K-Fold: Preserving Class Distribution

When your target variable is imbalanced — say, 90% negative and 10% positive — a random K-Fold split can produce folds with wildly different class ratios. One fold might have 15% positive samples while another has only 5%. That makes fold-to-fold scores unreliable, and you end up chasing noise instead of signal.

StratifiedKFold solves this by ensuring each fold contains approximately the same percentage of each class as the full dataset.

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20,
    weights=[0.9, 0.1], random_state=42
)

model = LogisticRegression(max_iter=1000)

# Compare KFold vs StratifiedKFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

kf_scores = cross_val_score(model, X, y, cv=kf, scoring="f1")
skf_scores = cross_val_score(model, X, y, cv=skf, scoring="f1")

print(f"KFold F1:           {kf_scores.mean():.4f} (+/- {kf_scores.std():.4f})")
print(f"StratifiedKFold F1: {skf_scores.mean():.4f} (+/- {skf_scores.std():.4f})")

Pay attention to the standard deviation. StratifiedKFold typically produces tighter, more consistent scores because each fold actually reflects the true class balance. It's one of those things that seems minor until you're comparing models and realize the variance was hiding in the folds, not the model.

When to use it: Any classification task, especially with imbalanced classes. Good news — scikit-learn uses it automatically when you pass an integer to cv= with a classifier.

Repeated K-Fold and Repeated Stratified K-Fold

Running K-Fold once gives you k scores. Running it multiple times with different random shuffles gives you k × n scores, producing a more stable estimate. This is particularly useful for smaller datasets where the variance between splits can be surprisingly high.

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.svm import SVC

model = SVC(kernel="rbf")

rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=rskf, scoring="f1")

print(f"50 fold scores (5 folds x 10 repeats)")
print(f"Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")

When to use it: Small to medium datasets where you want higher confidence in your estimate. For large datasets, the extra computation usually isn't worth the marginal improvement.

Group K-Fold: Respecting Data Dependencies

This is where things get interesting — and where I've personally seen the most inflated evaluation scores in real projects.

Many real-world datasets contain groups of related observations. Medical records from the same patient, multiple images from the same camera, or transactions from the same user — these aren't independent. If samples from the same group appear in both training and test sets, the model can exploit within-group patterns, leading to scores that look great but don't reflect actual generalization.

GroupKFold ensures that all samples from a given group stay together — either entirely in the training set or entirely in the test set.

import numpy as np
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=300, n_features=10, random_state=42)

# Simulate 30 patients, each with 10 observations
groups = np.repeat(np.arange(30), 10)

model = GradientBoostingClassifier(n_estimators=100, random_state=42)

gkf = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=gkf, groups=groups, scoring="accuracy")

print(f"GroupKFold accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

When to use it: Whenever your data has a grouping structure — patients, users, devices, geographic locations, sessions. If you're not sure whether groups matter, try both KFold and GroupKFold. A large drop in score with GroupKFold usually means your original evaluation was leaking information.

Stratified Group K-Fold: Groups + Class Balance

So what if your data has both groups and class imbalance? That's where StratifiedGroupKFold comes in. It tries to keep groups intact while also preserving the target distribution across folds. It was added in scikit-learn 1.0, solving a problem that previously required writing custom splitting logic (and I can tell you, that was no fun).

from sklearn.model_selection import StratifiedGroupKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Imbalanced dataset with groups
X, y = make_classification(
    n_samples=600, n_features=15,
    weights=[0.8, 0.2], random_state=42
)
groups = np.repeat(np.arange(60), 10)

model = RandomForestClassifier(n_estimators=100, random_state=42)

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    model, X, y, cv=sgkf, groups=groups, scoring="f1"
)

print(f"StratifiedGroupKFold F1: {scores.mean():.4f} (+/- {scores.std():.4f})")

When to use it: Medical data (patients with multiple visits), user analytics (multiple sessions per user), or really any grouped dataset with imbalanced targets.

Time Series Split: Respecting Temporal Order

Standard K-Fold shuffles data randomly, which means your model might train on Monday's data and predict Sunday's. That's literally data from the future leaking into training. For time-ordered data, this is a critical — and unfortunately very common — mistake.

TimeSeriesSplit enforces a strict temporal order: each test fold comes after its training fold in time. The training set grows with each split, mimicking how you'd actually deploy a model in a walk-forward scenario.

import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Ridge

# Simulate daily sales data
np.random.seed(42)
dates = pd.date_range("2024-01-01", periods=365, freq="D")
X = np.column_stack([
    np.arange(365),                          # trend
    np.sin(2 * np.pi * np.arange(365) / 7),  # weekly seasonality
    np.random.randn(365)                     # noise feature
])
y = 100 + 0.5 * X[:, 0] + 10 * X[:, 1] + np.random.randn(365) * 5

model = Ridge()

tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv, scoring="r2")

print("TimeSeriesSplit R² scores:")
for i, score in enumerate(scores):
    print(f"  Fold {i+1}: {score:.4f}")
print(f"Mean R²: {scores.mean():.4f}")

You'll notice the first fold typically has the worst score — it has the least training data to learn from. Later folds benefit from more history. This growing training set reflects how time series models actually work in practice.

When to use it: Any data with a time dimension — stock prices, sensor readings, web traffic, sales forecasting. No exceptions.

Leave-One-Out and Leave-P-Out

LeaveOneOut (LOO) creates n folds where each fold uses exactly one sample as the test set. It's the most thorough validation possible, but also the most expensive — for a dataset with 10,000 samples, you're training 10,000 separate models. Yeah.

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = KNeighborsClassifier(n_neighbors=5)

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring="accuracy")

print(f"LOO accuracy: {scores.mean():.4f} (over {len(scores)} folds)")

LeavePOut generalizes this to leaving p samples out, but the number of combinations grows rapidly (C(n, p)) and is rarely practical beyond very small datasets.

When to use it: Small datasets (under ~500 samples) where you need to squeeze every last bit of training data out of each fold.

Shuffle Split: Random Subsampling

ShuffleSplit generates independent train/test splits with a fixed test proportion. Unlike K-Fold, the same sample can appear in the test set of multiple splits, and not every sample is guaranteed to be tested. Think of it as repeated random subsampling — less structured than K-Fold, but sometimes that's exactly what you need.

from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True)
model = RandomForestRegressor(n_estimators=50, random_state=42)

ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
scores = cross_val_score(model, X, y, cv=ss, scoring="r2")

print(f"ShuffleSplit R²: {scores.mean():.4f} (+/- {scores.std():.4f})")

When to use it: Large datasets where K-Fold is computationally expensive, or when you want fine-grained control over the train/test ratio.

cross_val_score vs cross_validate: Which One Do You Need?

Scikit-learn gives you two main cross-validation functions, and picking between them is straightforward. cross_val_score returns an array of scores for a single metric — simple, clean, gets the job done. cross_validate returns a dictionary with fit times, score times, test scores, and optionally training scores. It also supports multiple metrics at once.

from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
model = GradientBoostingClassifier(n_estimators=100, random_state=42)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = cross_validate(
    model, X, y, cv=cv,
    scoring=["accuracy", "f1", "roc_auc"],
    return_train_score=True,
    n_jobs=-1
)

print(f"Test Accuracy:  {results['test_accuracy'].mean():.4f}")
print(f"Test F1:        {results['test_f1'].mean():.4f}")
print(f"Test ROC-AUC:   {results['test_roc_auc'].mean():.4f}")
print(f"Train Accuracy: {results['train_accuracy'].mean():.4f}")
print(f"Mean fit time:  {results['fit_time'].mean():.2f}s")

My general rule: use cross_validate when you need multiple metrics or want to compare training vs. test scores (a big gap between them is a strong signal of overfitting). Use cross_val_score for quick, single-metric sanity checks.

GPU Acceleration in scikit-learn 1.8

Here's something worth knowing: scikit-learn 1.8 introduced Array API support for cross_validate and cross_val_predict. If your pipeline uses compatible transformers and estimators, you can run cross-validation on GPU-backed arrays (via PyTorch or CuPy) for potentially significant speedups:

import sklearn
import torch

# Enable Array API dispatch
with sklearn.config_context(array_api_dispatch=True):
    X_gpu = torch.tensor(X, dtype=torch.float32, device="cuda")
    y_gpu = torch.tensor(y, dtype=torch.float32, device="cuda")
    results = cross_validate(model, X_gpu, y_gpu, cv=5)

The Data Leakage Trap: Why Pipelines Are Non-Negotiable

Alright, this might be the most important section in the entire article.

The single most common cross-validation mistake is preprocessing data before splitting. If you scale your features on the entire dataset and then cross-validate, information from the test fold has already leaked into the training fold through the scaler's mean and standard deviation. Your scores will be optimistically biased — sometimes subtly, sometimes dramatically.

The fix is simple: put all preprocessing inside a scikit-learn Pipeline. The pipeline ensures that each fold is independently fit and transformed.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# WRONG: scaling before cross-validation (data leakage!)
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)
# scores = cross_val_score(SVC(), X_scaled, y, cv=5)

# CORRECT: scaling inside a pipeline
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=10)),
    ("svm", SVC(kernel="rbf"))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="accuracy")

print(f"Pipeline CV accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

This applies to every preprocessing step — scaling, imputation, encoding, feature selection, PCA, SMOTE. If it learns parameters from data, it goes inside the pipeline. No exceptions.

Choosing the Right Strategy: A Decision Flowchart

Not sure which splitter to reach for? Walk through these questions:

Is your data time-ordered? → Use TimeSeriesSplit
Does your data have groups? (patients, users, devices)
- Balanced classes or regression → GroupKFold
- Imbalanced classification → StratifiedGroupKFold
Is it a classification task?
- Use StratifiedKFold (scikit-learn's default for classifiers)
Is it a regression task? → Use KFold with shuffle=True
Very small dataset? → Consider RepeatedStratifiedKFold or LeaveOneOut
Very large dataset? → Consider ShuffleSplit with fewer splits

When in doubt, start with StratifiedKFold for classification and KFold for regression. You can always upgrade to a more specialized strategy later.

Complete Real-World Example: Predicting Wine Quality

Let's put it all together — a full pipeline with proper cross-validation, multiple metrics, and a comparison across strategies:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    cross_validate,
    KFold,
    StratifiedKFold,
    RepeatedStratifiedKFold,
)

X, y = load_wine(return_X_y=True)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])

strategies = {
    "KFold(5)": KFold(n_splits=5, shuffle=True, random_state=42),
    "StratifiedKFold(5)": StratifiedKFold(
        n_splits=5, shuffle=True, random_state=42
    ),
    "RepeatedStratified(5x3)": RepeatedStratifiedKFold(
        n_splits=5, n_repeats=3, random_state=42
    ),
}

scoring = ["accuracy", "f1_weighted"]

for name, cv in strategies.items():
    results = cross_validate(pipe, X, y, cv=cv, scoring=scoring, n_jobs=-1)
    acc = results["test_accuracy"]
    f1 = results["test_f1_weighted"]
    print(f"{name:<30} Accuracy: {acc.mean():.4f} (+/- {acc.std():.4f})  "
          f"F1: {f1.mean():.4f} (+/- {f1.std():.4f})")

Common Pitfalls and How to Avoid Them

1. Using a Fixed Random State in the Estimator

Passing random_state=42 to your estimator means it uses the same internal RNG on every fold. If your results look great, it might just be luck with that particular seed. Consider passing random_state=None (or a RandomState instance) so each fold gets a different RNG, producing more honest variance estimates.

2. Treating cross_val_predict Scores as Performance Estimates

This one trips up a lot of people. cross_val_predict returns predictions, not scores. Computing metrics on its output is not equivalent to cross_val_score — the predictions come from models trained on different subsets, so they aren't directly comparable. Stick with cross_val_score or cross_validate for performance estimation.

3. Ignoring the Variance

Reporting only the mean CV score hides crucial information. A mean accuracy of 0.85 with a standard deviation of 0.02 is far more trustworthy than 0.85 with a standard deviation of 0.10. Always report both — your future self (and your reviewers) will thank you.

4. Too Many or Too Few Folds

5 or 10 folds work well for the vast majority of problems. Using 2 or 3 folds wastes training data and can produce high-bias estimates. Going to 50 folds on a 500-sample dataset makes each test fold tiny and adds computational cost with minimal benefit. Find the sweet spot.

Integrating Cross-Validation with Hyperparameter Tuning

Cross-validation and hyperparameter tuning go hand in hand. GridSearchCV and RandomizedSearchCV use cross-validation internally to evaluate each hyperparameter combination:

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC())
])

param_grid = {
    "svm__C": [0.1, 1, 10],
    "svm__kernel": ["rbf", "linear"],
    "svm__gamma": ["scale", "auto"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = GridSearchCV(
    pipe, param_grid, cv=cv,
    scoring="f1", n_jobs=-1, refit=True
)
search.fit(X, y)

print(f"Best params: {search.best_params_}")
print(f"Best F1:     {search.best_score_:.4f}")

Notice that the Pipeline handles scaling inside the cross-validation loop — no data leakage, even during hyperparameter search. That's the whole point of making pipelines non-negotiable.

Summary Table: Every Scikit-Learn Splitter at a Glance

Splitter	Preserves Classes	Respects Groups	Respects Time	Best For
`KFold`	No	No	No	Regression, balanced classification
`StratifiedKFold`	Yes	No	No	Classification (default)
`RepeatedKFold`	No	No	No	Small datasets, regression
`RepeatedStratifiedKFold`	Yes	No	No	Small datasets, classification
`GroupKFold`	No	Yes	No	Grouped data, regression
`StratifiedGroupKFold`	Yes	Yes	No	Grouped data, imbalanced classes
`TimeSeriesSplit`	No	No	Yes	Time-ordered data
`LeaveOneOut`	No	No	No	Very small datasets
`ShuffleSplit`	No	No	No	Large datasets, custom ratios

FAQ

What is the difference between KFold and StratifiedKFold in scikit-learn?

KFold splits data into equal-sized folds without considering the target variable. StratifiedKFold ensures each fold maintains the same class distribution as the full dataset. For classification tasks — especially imbalanced ones — StratifiedKFold produces more reliable scores. Scikit-learn uses StratifiedKFold automatically for classifiers when you pass an integer to cv=.

How do I prevent data leakage during cross-validation?

Wrap all preprocessing steps (scaling, imputation, encoding, feature selection) inside a scikit-learn Pipeline. The pipeline ensures each transformation is fit only on training data and applied independently to the test fold. Never call fit_transform on your full dataset before cross-validation — it's the most common source of leakage I come across.

How many folds should I use for cross-validation?

5-fold and 10-fold are the most common choices and work well for most problems. Fewer folds (like 3) risk high bias because training sets are smaller. More folds (20+) increase computation with diminishing returns. For very small datasets, consider RepeatedStratifiedKFold or LeaveOneOut instead.

Can I use regular KFold for time series data?

No — and this is a surprisingly common mistake. Standard K-Fold shuffles data randomly, which means future observations can leak into the training set. For time-ordered data, use TimeSeriesSplit, which ensures the training set always precedes the test set chronologically. This mimics real-world deployment where you only have historical data available for training.

What is the difference between cross_val_score and cross_validate?

cross_val_score returns an array of scores for a single metric. cross_validate returns a dictionary with test scores, fit times, score times, and optionally training scores — and it supports evaluating multiple metrics simultaneously. Use cross_validate when you need more than one metric or want to detect overfitting by comparing train vs. test scores.

Why Cross-Validation Matters

The Split-Apply-Combine Pattern

K-Fold Cross-Validation: The Baseline

Stratified K-Fold: Preserving Class Distribution

Repeated K-Fold and Repeated Stratified K-Fold

Group K-Fold: Respecting Data Dependencies

Stratified Group K-Fold: Groups + Class Balance

Time Series Split: Respecting Temporal Order

Leave-One-Out and Leave-P-Out

Shuffle Split: Random Subsampling

cross_val_score vs cross_validate: Which One Do You Need?

GPU Acceleration in scikit-learn 1.8

The Data Leakage Trap: Why Pipelines Are Non-Negotiable

Choosing the Right Strategy: A Decision Flowchart

Complete Real-World Example: Predicting Wine Quality

Common Pitfalls and How to Avoid Them

1. Using a Fixed Random State in the Estimator

2. Treating cross_val_predict Scores as Performance Estimates

3. Ignoring the Variance

4. Too Many or Too Few Folds

Integrating Cross-Validation with Hyperparameter Tuning

Summary Table: Every Scikit-Learn Splitter at a Glance

FAQ

What is the difference between KFold and StratifiedKFold in scikit-learn?

How do I prevent data leakage during cross-validation?

How many folds should I use for cross-validation?

Can I use regular KFold for time series data?

What is the difference between cross_val_score and cross_validate?

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)