Dimensionality Reduction in Python: PCA, t-SNE, and UMAP Compared

Learn when to use PCA, t-SNE, or UMAP for dimensionality reduction in Python. Runnable scikit-learn 1.8 code examples, side-by-side benchmarks, parameter tuning tips, and a practical decision framework for choosing the right technique.

High-dimensional datasets are everywhere — gene expression matrices, text embeddings, image features, sensor readings. If you've ever tried building a model on hundreds or thousands of features, you already know the pain. The curse of dimensionality kicks in fast: models overfit, distance metrics stop being useful, and good luck trying to visualize anything. Dimensionality reduction fixes this by projecting your data into a lower-dimensional space while (hopefully) keeping the structure that actually matters.

In this guide, I'll walk you through the three most popular dimensionality reduction techniques in Python — PCA, t-SNE, and UMAP — with complete, runnable code on a real dataset. You'll learn how each algorithm works under the hood, when to reach for one over another, and how to combine them in a practical pipeline using scikit-learn 1.8 and umap-learn 0.5.

What Is Dimensionality Reduction?

Dimensionality reduction transforms a dataset with many features into one with fewer features, while retaining as much useful information as possible. There are two primary use cases:

  • Visualization — reducing to 2 or 3 dimensions so humans can actually see clusters and patterns.
  • Feature compression — reducing to 10–50 dimensions so downstream models train faster and generalize better.

The techniques split into two families. Linear methods like PCA find straight-line projections that maximize variance. Non-linear methods like t-SNE and UMAP learn curved manifolds and can uncover clusters that linear methods miss entirely.

That distinction matters more than you might think. I've seen teams spend hours tweaking PCA on data that had clear non-linear structure — switching to UMAP gave them usable clusters in minutes.

Setting Up the Environment

All examples below use the same dataset and imports. Install the required packages first:

pip install scikit-learn umap-learn matplotlib numpy

We'll use the Digits dataset from scikit-learn — 1,797 samples of 8×8 handwritten digit images (64 features). It's large enough to show meaningful structure yet small enough to run every technique in seconds.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Load and scale the data
digits = load_digits()
X, y = digits.data, digits.target  # X shape: (1797, 64)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {np.unique(y)}")  # 0-9 digits

Scaling is important here: PCA is sensitive to feature magnitudes, and standardizing to zero mean and unit variance ensures no single feature dominates the projection. Don't skip this step — I've seen it ruin results more than once.

PCA: Linear Dimensionality Reduction

How PCA Works

Principal Component Analysis finds orthogonal axes — called principal components — that capture the maximum variance in the data. The first component explains the most variance, the second explains the next-most (while being orthogonal to the first), and so on.

Because PCA uses a linear transformation, it's fast and deterministic. Same data in, same result out. Every time. That predictability is honestly one of its best features.

PCA in scikit-learn

from sklearn.decomposition import PCA

# Reduce to 2 components for visualization
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {pca.explained_variance_ratio_.sum():.1%}")

# Plot
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
plt.colorbar(scatter, label="Digit")
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA — Digits Dataset (2 Components)")
plt.tight_layout()
plt.savefig("pca_digits.png", dpi=150)
plt.show()

With only two components, PCA retains roughly 28% of the total variance. Some digit groups will overlap because — well — a linear projection just can't fully separate 10 classes that live in 64-dimensional space.

Choosing the Number of Components

When using PCA for feature compression rather than visualization, you need to decide how many components to keep. Here are three practical approaches:

# Method 1: Variance threshold — keep 95% of variance
pca_95 = PCA(n_components=0.95, svd_solver="full")
X_95 = pca_95.fit_transform(X_scaled)
print(f"Components for 95% variance: {pca_95.n_components_}")  # ~28 components

# Method 2: Elbow / scree plot
pca_full = PCA().fit(X_scaled)
cumulative = np.cumsum(pca_full.explained_variance_ratio_)

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(cumulative) + 1), cumulative, "o-", markersize=3)
plt.axhline(y=0.95, color="r", linestyle="--", label="95% threshold")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Scree Plot — Digits Dataset")
plt.legend()
plt.tight_layout()
plt.savefig("pca_scree.png", dpi=150)
plt.show()

# Method 3: Minka's MLE — automatic selection
pca_mle = PCA(n_components="mle", svd_solver="full")
X_mle = pca_mle.fit_transform(X_scaled)
print(f"MLE selected components: {pca_mle.n_components_}")

For most ML pipelines, retaining 95% of the variance is a solid default. Just pass n_components=0.95 and scikit-learn handles the rest. Easy.

t-SNE: Non-Linear Visualization

How t-SNE Works

t-Distributed Stochastic Neighbor Embedding (quite a mouthful, right?) converts pairwise similarities in the high-dimensional space into probability distributions, then finds a low-dimensional arrangement that minimizes the divergence between the original and embedded distributions. It excels at preserving local structure — nearby points in the original space stay nearby in the embedding — making clusters visually pop.

There's a catch, though. t-SNE is stochastic: different random seeds produce different layouts. The axes have no inherent meaning, and distances between well-separated clusters aren't reliable. Use t-SNE for exploration, not for downstream modeling.

t-SNE in scikit-learn 1.8

from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,         # balance between local and global structure
    learning_rate="auto",  # defaults to max(N/12/4, 50)
    max_iter=1000,
    init="pca",            # scikit-learn 1.8 supports sparse input with PCA init
    random_state=42,
)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
plt.colorbar(scatter, label="Digit")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE — Digits Dataset")
plt.tight_layout()
plt.savefig("tsne_digits.png", dpi=150)
plt.show()

The result typically shows 10 tight, well-separated clusters — far clearer than the PCA plot. That's the power of non-linear embedding for visualization. The first time I saw my messy PCA blob turn into clean t-SNE clusters, it was genuinely satisfying.

Key t-SNE Parameters

  • perplexity (default 30): Controls how many neighbors each point considers. Lower values emphasize local structure; higher values capture more global structure. Try values between 5 and 50.
  • learning_rate: Set to "auto" in scikit-learn 1.8, which usually works well. Manual tuning between 10 and 1000 is sometimes needed for very large or very small datasets.
  • max_iter: Number of optimization iterations. The default of 1000 is sufficient for most datasets.
  • init: Use "pca" for more reproducible results. scikit-learn 1.8 now supports PCA initialization even with sparse input matrices.

Speeding Up t-SNE with PCA Preprocessing

t-SNE scales as O(N²) with the exact method and O(N log N) with Barnes-Hut. For large datasets, reducing to 50 dimensions with PCA first makes a huge difference:

# PCA to 50 dims, then t-SNE to 2 dims
pca_50 = PCA(n_components=50, random_state=42)
X_pca50 = pca_50.fit_transform(X_scaled)

tsne_fast = TSNE(n_components=2, perplexity=30, init="pca", random_state=42)
X_tsne_fast = tsne_fast.fit_transform(X_pca50)
# Significantly faster on high-dimensional data (1000+ features)

UMAP: Fast Non-Linear Reduction

How UMAP Works

Uniform Manifold Approximation and Projection builds a topological graph of the high-dimensional data based on nearest neighbors, then optimizes a low-dimensional layout to match that graph's structure. What makes UMAP special is that it preserves both local and global structure, runs much faster than t-SNE, and can project new unseen data — something t-SNE simply can't do.

UMAP in Python

from umap import UMAP

reducer = UMAP(
    n_components=2,
    n_neighbors=15,    # controls local vs global balance
    min_dist=0.1,      # how tightly points cluster
    metric="euclidean",
    random_state=42,
)
X_umap = reducer.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
plt.colorbar(scatter, label="Digit")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.title("UMAP — Digits Dataset")
plt.tight_layout()
plt.savefig("umap_digits.png", dpi=150)
plt.show()

UMAP typically produces clusters as visually distinct as t-SNE, while maintaining more meaningful inter-cluster distances. Digits that are visually similar (like 3 and 8, or 1 and 7) often appear near each other, reflecting real global relationships. That's a nice bonus you don't get with t-SNE.

Key UMAP Parameters

  • n_neighbors (default 15): Larger values capture more global structure; smaller values emphasize local detail. Typical range: 5–50.
  • min_dist (default 0.1): Controls the minimum distance between points in the embedding. Lower values produce tighter clusters; higher values spread points out more evenly.
  • metric: The distance metric used to compute neighbors. Use "euclidean" for numeric data, "cosine" for text embeddings, or "manhattan" for sparse count data.
  • n_components: Unlike t-SNE, UMAP works well with more than 2 dimensions. Use 10–50 components for downstream ML tasks.

Transforming New Data

This is a key advantage of UMAP over t-SNE: once fitted, you can project unseen data without refitting. In production settings, this is a dealbreaker.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Fit on training data
reducer = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
reducer.fit(X_train)

# Transform both sets
X_train_umap = reducer.transform(X_train)
X_test_umap = reducer.transform(X_test)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
axes[0].scatter(X_train_umap[:, 0], X_train_umap[:, 1], c=y_train, cmap="tab10", s=8, alpha=0.7)
axes[0].set_title("UMAP — Training Set")
axes[1].scatter(X_test_umap[:, 0], X_test_umap[:, 1], c=y_test, cmap="tab10", s=8, alpha=0.7)
axes[1].set_title("UMAP — Test Set (Transformed)")
plt.tight_layout()
plt.savefig("umap_train_test.png", dpi=150)
plt.show()

Side-by-Side Comparison

So, let's put all three techniques next to each other on the same dataset and see the differences at a glance:

import time

results = {}
methods = {
    "PCA": lambda X: PCA(n_components=2, random_state=42).fit_transform(X),
    "t-SNE": lambda X: TSNE(n_components=2, perplexity=30, init="pca", random_state=42).fit_transform(X),
    "UMAP": lambda X: UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X),
}

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for ax, (name, func) in zip(axes, methods.items()):
    start = time.perf_counter()
    X_reduced = func(X_scaled)
    elapsed = time.perf_counter() - start
    results[name] = elapsed

    scatter = ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
    ax.set_title(f"{name} ({elapsed:.2f}s)")
    ax.set_xlabel(f"{name} 1")
    ax.set_ylabel(f"{name} 2")

plt.tight_layout()
plt.savefig("comparison_all.png", dpi=150)
plt.show()

for name, t in results.items():
    print(f"{name}: {t:.3f} seconds")

On the Digits dataset (1,797 samples, 64 features), typical runtimes are:

  • PCA: ~0.005 seconds
  • UMAP: ~2–5 seconds
  • t-SNE: ~3–8 seconds

The speed gap widens dramatically on larger datasets. At 100K+ samples, PCA still runs in under a second, UMAP finishes in tens of seconds, while t-SNE can take minutes to hours. If you're working with big data, that difference alone might make your decision for you.

When to Use Which Technique

Here's a decision framework to help you pick the right tool:

CriterionPCAt-SNEUMAP
TypeLinearNon-linearNon-linear
SpeedVery fastSlowFast
Preserves global structureYesNoPartially
Preserves local structurePoorlyExcellentExcellent
DeterministicYesNoNo
Transform new dataYesNoYes
Use in ML pipelinesYesRarelyYes
Interpretable componentsYesNoNo
Best for >100K samplesYesNoYes

Practical Guidelines

  • Preprocessing for ML models: Use PCA to reduce features before training classifiers or regressors. Set n_components=0.95 for an automatic threshold.
  • Exploring cluster structure: Use UMAP or t-SNE for 2D visualization. UMAP is the better choice for datasets with more than 10K samples.
  • Production pipelines: UMAP supports transform() on new data, making it suitable for embedding incoming data without refitting. PCA is even more lightweight for this purpose.
  • Combining techniques: Apply PCA first (to ~50 components) to remove noise and speed up UMAP or t-SNE on very high-dimensional data. This is honestly what I'd recommend as a default starting point for most projects.

Building a PCA + UMAP Pipeline with scikit-learn

Because umap-learn follows the scikit-learn transformer API, you can chain PCA and UMAP into a single pipeline. This is where things get really convenient:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Pipeline: Scale → PCA (50 dims) → UMAP (10 dims) → Classify
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=50, random_state=42)),
    ("umap", UMAP(n_components=10, n_neighbors=15, min_dist=0.1, random_state=42)),
    ("clf", RandomForestClassifier(n_estimators=100, random_state=42)),
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring="accuracy")
print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

This pipeline scales the data, reduces 64 features to 50 with PCA, compresses further to 10 with UMAP, and classifies with a Random Forest. The PCA step removes noise and speeds up UMAP, while UMAP captures the non-linear structure that PCA misses. It's a clean setup that works well in practice.

Supervised UMAP for Better Separation

UMAP has a trick that PCA and t-SNE lack: supervised dimensionality reduction. By passing labels to the fit method, UMAP adjusts the embedding to improve class separation:

# Unsupervised vs supervised UMAP
umap_unsup = UMAP(n_components=2, random_state=42)
X_unsup = umap_unsup.fit_transform(X_scaled)

umap_sup = UMAP(n_components=2, random_state=42)
X_sup = umap_sup.fit_transform(X_scaled, y=y)  # pass labels

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
axes[0].scatter(X_unsup[:, 0], X_unsup[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
axes[0].set_title("UMAP — Unsupervised")
axes[1].scatter(X_sup[:, 0], X_sup[:, 1], c=y, cmap="tab10", s=8, alpha=0.7)
axes[1].set_title("UMAP — Supervised")
plt.tight_layout()
plt.savefig("umap_supervised.png", dpi=150)
plt.show()

Supervised UMAP produces noticeably cleaner class boundaries. If you have labels available (even partial ones), it's worth trying this as a preprocessing step before classification. The improvement can be surprisingly large.

Frequently Asked Questions

Do I need to scale my data before applying PCA, t-SNE, or UMAP?

Yes — especially for PCA. Since PCA finds directions of maximum variance, features with larger scales will dominate the result. Use StandardScaler to standardize features to zero mean and unit variance before applying any of these techniques. t-SNE and UMAP are somewhat less sensitive to scaling, but it's still best practice to standardize first for consistent results.

Can I use t-SNE or UMAP embeddings as features for a machine learning model?

UMAP — yes, and it works well. Because UMAP supports transform(), you can fit on training data and apply the same transformation to test data. t-SNE — generally no. It has no transform() method, so you can't apply it to unseen data. It also distorts global distances, making the embedding unreliable as features. For ML pipelines, stick with PCA or UMAP for feature reduction.

How do I choose the right perplexity for t-SNE or n_neighbors for UMAP?

Both parameters control the balance between local and global structure. Start with the defaults — perplexity=30 for t-SNE and n_neighbors=15 for UMAP. If your clusters look too fragmented, increase the value to capture more global context. If clusters are too diffuse and merged, decrease it. For t-SNE perplexity, try values in the range 5–50. For UMAP n_neighbors, try 5–50. Always visualize results at multiple settings before drawing conclusions.

Why does t-SNE give different results each time I run it?

t-SNE is stochastic — it uses random initialization and gradient descent, so different runs produce different layouts. Set random_state=42 (or any fixed integer) for reproducibility. Also use init="pca" in scikit-learn 1.8 for a more stable starting point. Even with fixed seeds, keep in mind that the absolute positions and axis scales in a t-SNE plot are meaningless — only the relative distances between nearby points carry information.

How do PCA, t-SNE, and UMAP handle very large datasets (100K+ samples)?

PCA handles large datasets easily — it runs in seconds even on millions of samples. Use IncrementalPCA from scikit-learn if the data doesn't fit in memory. UMAP scales well to hundreds of thousands of samples and typically finishes in under a minute. t-SNE struggles above 10K–50K samples. For large datasets, the go-to approach is PCA → UMAP: reduce to 50 components with PCA first, then apply UMAP for final reduction or visualization.

About the Author Editorial Team

Our team of expert writers and editors.