Clustering in Python: KMeans, DBSCAN, and HDBSCAN Compared

A hands-on comparison of KMeans, DBSCAN, and HDBSCAN clustering in scikit-learn 1.8. Working Python code, evaluation metrics, parameter tuning tips, and a decision framework to pick the right algorithm.

Clustering is one of those unsupervised learning techniques that sounds straightforward until you actually sit down and try to pick the right algorithm. I've been there — staring at a scatter plot, wondering whether my data wants KMeans, DBSCAN, or something fancier. The good news? Scikit-learn 1.8 now ships all three major options natively, including HDBSCAN, so you don't even need an extra package anymore.

This guide walks you through KMeans, DBSCAN, and HDBSCAN with working Python code. We'll look at how to evaluate results when you don't have ground-truth labels (which, let's be honest, is most of the time), and I'll give you a practical decision framework for choosing the right one.

Prerequisites

You'll need Python 3.10+ and a few packages. All three clustering algorithms live inside sklearn.cluster, so one install covers everything.

pip install scikit-learn==1.8.0 matplotlib numpy
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, HDBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

Preparing Synthetic Datasets

We'll use two synthetic datasets that really highlight where each algorithm shines (and where it falls apart). The first contains well-separated spherical blobs — KMeans territory. The second has crescent-shaped clusters with noise — exactly the kind of thing that trips up centroid-based methods.

# Dataset 1: Spherical blobs
X_blobs, y_blobs = make_blobs(
    n_samples=800, centers=4, cluster_std=0.6, random_state=42
)

# Dataset 2: Crescent moons with noise
X_moons, y_moons = make_moons(n_samples=500, noise=0.08, random_state=42)

# Always scale before clustering
scaler = StandardScaler()
X_blobs_scaled = scaler.fit_transform(X_blobs)
X_moons_scaled = scaler.fit_transform(X_moons)

Quick note on scaling: it's not optional. All three algorithms rely on distance calculations, so if one feature ranges from 0 to 10,000 and another from 0 to 1, the big one will completely dominate your results.

KMeans Clustering

How KMeans Works

KMeans is the classic. It partitions data into k clusters by repeating two steps: assign each point to the nearest centroid, then recalculate the centroid as the mean of its assigned points. Rinse and repeat until things stabilize.

It's fast, it's intuitive, and for the right kind of data, it works beautifully.

Running KMeans in Scikit-Learn

kmeans = KMeans(n_clusters=4, init="k-means++", n_init=10, random_state=42)
labels_km = kmeans.fit_predict(X_blobs_scaled)

print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X_blobs_scaled, labels_km):.3f}")

The init="k-means++" strategy picks initial centroids that are spread apart, which speeds up convergence and helps avoid those frustrating poor local minima. Setting n_init=10 runs the algorithm ten times with different starting points and keeps the best result — a small cost for much better reliability.

Finding the Optimal Number of Clusters

Here's the catch with KMeans: you have to tell it how many clusters you want. Two popular methods can help you figure that out.

The Elbow Method

Plot the within-cluster sum of squares (inertia) for a range of k values. The "elbow" — where adding more clusters stops helping much — is your sweet spot.

inertias = []
K = range(2, 10)
for k in K:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_blobs_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K, inertias, "bo-")
plt.xlabel("Number of Clusters k")
plt.ylabel("Inertia (WCSS)")
plt.title("Elbow Method")
plt.tight_layout()
plt.show()

Silhouette Analysis

The silhouette score measures how similar each point is to its own cluster versus the nearest neighboring cluster. Scores range from -1 (definitely in the wrong cluster) to +1 (perfect fit). Just pick the k with the highest mean silhouette score.

sil_scores = []
for k in K:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_blobs_scaled)
    sil_scores.append(silhouette_score(X_blobs_scaled, labels))

plt.figure(figsize=(8, 4))
plt.plot(K, sil_scores, "go-")
plt.xlabel("Number of Clusters k")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Analysis")
plt.tight_layout()
plt.show()

When KMeans Falls Short

KMeans assumes your clusters are spherical and roughly the same size. When they're not — non-convex shapes, varying densities, noisy outliers — it forces every single point into a cluster and often gets it wrong. Try it on the crescent moons and you'll see what I mean.

labels_km_moons = KMeans(n_clusters=2, n_init=10, random_state=42).fit_predict(X_moons_scaled)

plt.figure(figsize=(8, 4))
plt.scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=labels_km_moons, cmap="viridis", s=20)
plt.title("KMeans on Crescent Moons — Fails to Capture Shape")
plt.tight_layout()
plt.show()

DBSCAN Clustering

How DBSCAN Works

DBSCAN takes a completely different approach. Instead of centroids, it thinks in terms of density — clusters are dense regions separated by sparse ones. Every point gets classified as one of three things:

  • Core points — have at least min_samples neighbors within radius eps. These are the backbone of each cluster.
  • Border points — within eps of a core point but don't have enough neighbors to be core points themselves.
  • Noise points — don't belong to any cluster. They get labeled -1.

The best part? You don't need to specify how many clusters you want. DBSCAN figures that out on its own.

Choosing eps with a k-Distance Plot

Getting eps right is honestly the trickiest part of DBSCAN. The most reliable approach: compute the distance to the kth nearest neighbor for every point (where k = min_samples), sort those distances, and look for a sharp bend in the curve.

from sklearn.neighbors import NearestNeighbors

min_samples = 5
nn = NearestNeighbors(n_neighbors=min_samples)
nn.fit(X_moons_scaled)
distances, _ = nn.kneighbors(X_moons_scaled)

# Sort the distance to the k-th nearest neighbor
k_distances = np.sort(distances[:, -1])

plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.xlabel("Points (sorted)")
plt.ylabel(f"Distance to {min_samples}th Nearest Neighbor")
plt.title("k-Distance Plot — Look for the Elbow")
plt.tight_layout()
plt.show()

Running DBSCAN

dbscan = DBSCAN(eps=0.3, min_samples=5)
labels_db = dbscan.fit_predict(X_moons_scaled)

n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = list(labels_db).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

plt.figure(figsize=(8, 4))
plt.scatter(X_moons_scaled[:, 0], X_moons_scaled[:, 1], c=labels_db, cmap="viridis", s=20)
plt.title("DBSCAN on Crescent Moons — Captures Arbitrary Shapes")
plt.tight_layout()
plt.show()

And just like that, DBSCAN nails the two crescent clusters and flags stray points as noise. KMeans can't do that.

DBSCAN Limitations

There's a catch, though. DBSCAN uses one global eps for everything. If your clusters have very different densities — one tight, one sparse — it struggles. Tight clusters might merge, or sparse ones get treated entirely as noise.

It also doesn't love high-dimensional data (curse of dimensionality and all that). If you've got more than about 20 features, run PCA or UMAP first to bring the dimensions down.

HDBSCAN: The Best of Both Worlds

How HDBSCAN Works

So what if you could have DBSCAN's ability to find arbitrary shapes without the headache of picking the right eps? That's essentially what HDBSCAN gives you.

Instead of using a single density threshold, HDBSCAN scans across all possible density levels, builds a hierarchy of clusters, and then uses a stability metric to extract the most persistent ones. The result is an algorithm that adapts to varying densities automatically.

Since scikit-learn 1.3, HDBSCAN lives right inside sklearn.cluster.HDBSCAN — no external package required. Honestly, this was a game-changer for simplifying dependencies in production pipelines.

Key Parameters

  • min_cluster_size (default 5) — the smallest group that counts as a cluster. Set this based on what makes sense for your domain.
  • min_samples (defaults to min_cluster_size) — controls how conservative the clustering is. Higher values = fewer, denser clusters and more noise points.
  • cluster_selection_method"eom" (Excess of Mass, the default) finds the most persistent clusters. "leaf" gives you the most fine-grained, homogeneous clusters.

Running HDBSCAN in Scikit-Learn 1.8

hdbscan = HDBSCAN(min_cluster_size=15, min_samples=5, cluster_selection_method="eom")
labels_hdb = hdbscan.fit_predict(X_moons_scaled)

n_clusters = len(set(labels_hdb)) - (1 if -1 in labels_hdb else 0)
n_noise = list(labels_hdb).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

# HDBSCAN provides membership probabilities
print(f"Mean membership probability: {hdbscan.probabilities_.mean():.3f}")

plt.figure(figsize=(8, 4))
scatter = plt.scatter(
    X_moons_scaled[:, 0], X_moons_scaled[:, 1],
    c=labels_hdb, cmap="viridis", s=20, alpha=0.7
)
plt.title("HDBSCAN on Crescent Moons — No eps Needed")
plt.tight_layout()
plt.show()

Soft Clustering with Membership Probabilities

Here's something I really like about HDBSCAN: the probabilities_ attribute. Each point gets a score from 0.0 (basically noise) to 1.0 (solidly inside a cluster). This is incredibly useful when you need confidence levels rather than just hard "you're in cluster 3" assignments — think customer segmentation where borderline cases actually matter.

# Visualize membership strength
plt.figure(figsize=(8, 4))
scatter = plt.scatter(
    X_moons_scaled[:, 0], X_moons_scaled[:, 1],
    c=hdbscan.probabilities_, cmap="YlOrRd", s=20
)
plt.colorbar(scatter, label="Membership Probability")
plt.title("HDBSCAN Membership Probabilities")
plt.tight_layout()
plt.show()

Tuning min_cluster_size

The main knob you'll turn with HDBSCAN is min_cluster_size. Sweeping over a range and checking evaluation metrics is the easiest way to find a good value.

sizes = range(5, 50, 5)
sil_scores = []
n_clusters_list = []

for size in sizes:
    hdb = HDBSCAN(min_cluster_size=size, min_samples=5)
    labels = hdb.fit_predict(X_blobs_scaled)
    mask = labels != -1  # Exclude noise for metrics
    if len(set(labels[mask])) >= 2:
        sil_scores.append(silhouette_score(X_blobs_scaled[mask], labels[mask]))
    else:
        sil_scores.append(0)
    n_clusters_list.append(len(set(labels)) - (1 if -1 in labels else 0))

fig, ax1 = plt.subplots(figsize=(8, 4))
ax1.plot(list(sizes), sil_scores, "go-", label="Silhouette")
ax1.set_xlabel("min_cluster_size")
ax1.set_ylabel("Silhouette Score", color="g")
ax2 = ax1.twinx()
ax2.plot(list(sizes), n_clusters_list, "bs--", label="Clusters")
ax2.set_ylabel("Number of Clusters", color="b")
plt.title("Tuning HDBSCAN min_cluster_size")
fig.tight_layout()
plt.show()

Evaluating Clusters Without Ground Truth

In the real world, you almost never have labels telling you which cluster each point "should" belong to. That's fine — scikit-learn has three solid internal metrics that work without labeled data.

Three Key Metrics

  • Silhouette Score — measures cohesion vs. separation. Range: -1 to +1. Higher is better.
  • Calinski-Harabasz Index — ratio of between-cluster dispersion to within-cluster dispersion. Higher is better, but it's biased toward convex (spherical) clusters.
  • Davies-Bouldin Index — average similarity of each cluster with its most similar neighbor. Lower is better.
def evaluate_clustering(X, labels, algorithm_name):
    """Compute and print internal clustering metrics, excluding noise."""
    mask = labels != -1
    X_clean = X[mask]
    labels_clean = labels[mask]
    n_clusters = len(set(labels_clean))

    if n_clusters < 2:
        print(f"{algorithm_name}: fewer than 2 clusters — metrics undefined")
        return

    sil = silhouette_score(X_clean, labels_clean)
    ch = calinski_harabasz_score(X_clean, labels_clean)
    db = davies_bouldin_score(X_clean, labels_clean)
    noise_pct = (labels == -1).sum() / len(labels) * 100

    print(f"\n{algorithm_name}")
    print(f"  Clusters:             {n_clusters}")
    print(f"  Noise points:         {noise_pct:.1f}%")
    print(f"  Silhouette Score:     {sil:.3f}")
    print(f"  Calinski-Harabasz:    {ch:.1f}")
    print(f"  Davies-Bouldin:       {db:.3f}")

# Evaluate all three on the blobs dataset
evaluate_clustering(X_blobs_scaled, labels_km, "KMeans")
evaluate_clustering(
    X_blobs_scaled,
    DBSCAN(eps=0.5, min_samples=5).fit_predict(X_blobs_scaled),
    "DBSCAN"
)
evaluate_clustering(
    X_blobs_scaled,
    HDBSCAN(min_cluster_size=15).fit_predict(X_blobs_scaled),
    "HDBSCAN"
)

One thing to keep in mind: Calinski-Harabasz naturally favors convex clusters, so KMeans might score higher on that metric even when a density-based method is actually the better fit for your data's shape. Don't rely on a single number.

Head-to-Head Comparison

Here's a quick-reference table to help you decide at a glance.

FeatureKMeansDBSCANHDBSCAN
Must specify number of clustersYesNoNo
Handles non-spherical shapesNoYesYes
Handles varying densitiesNoNoYes
Labels noise / outliersNoYesYes
Provides membership probabilitiesNoNoYes
Key parameter(s)n_clusterseps, min_samplesmin_cluster_size
ScalabilityExcellent (large n)GoodGood
Native in scikit-learn 1.8YesYesYes

Decision Framework: Which Algorithm Should You Use?

When you're staring at a new dataset and aren't sure where to start, these rules of thumb have served me well:

  1. Data is spherical and well-separated, and you know (or can estimate) the cluster count → Start with KMeans. It's the fastest option and the easiest to explain to non-technical stakeholders.
  2. Data has arbitrary shapes and you need outlier detection → Use DBSCAN if clusters share similar density. Pick eps with a k-distance plot.
  3. Clusters have varying densities, or you want minimal tuning → Go with HDBSCAN. It drops the eps parameter entirely and adapts to density differences on its own.
  4. High-dimensional data (>20 features) → Reduce dimensions first with PCA or UMAP, then apply DBSCAN or HDBSCAN.
  5. Not sure at all? → Just start with HDBSCAN. It works well out of the box, handles noise gracefully, and only needs one intuitive parameter (min_cluster_size). You can always compare against KMeans afterward.

Real-World Tips

  • Always scale your features. Use StandardScaler or MinMaxScaler before clustering. I've seen people debug for hours only to realize they forgot this step.
  • Exclude noise when computing metrics. Silhouette, Calinski-Harabasz, and Davies-Bouldin scores should only include clustered points when using DBSCAN or HDBSCAN. Including noise points will tank your scores unfairly.
  • Visualize with UMAP. For high-dimensional data, project to 2D with UMAP before plotting. If the clusters don't look reasonable visually, the numbers probably aren't telling the full story either.
  • Combine algorithms. Run KMeans for a quick baseline, then try HDBSCAN to see if density-based clustering reveals structure that KMeans missed. Comparing both gives you more confidence in your results.
  • Watch for the min_samples gotcha. In scikit-learn's HDBSCAN, min_samples includes the point itself. If you're migrating from the external hdbscan package, add 1 to your previous value to get equivalent behavior.

Frequently Asked Questions

What is the difference between DBSCAN and HDBSCAN?

DBSCAN uses a single global density threshold (eps) to define clusters, meaning all clusters need to share roughly the same density. HDBSCAN removes that constraint by scanning across all possible density levels and extracting the most stable clusters. In practice, HDBSCAN requires less parameter fiddling and handles datasets with mixed-density clusters much better.

How do I choose the right number of clusters for KMeans?

The two go-to methods are the Elbow Method (plot inertia vs. k, look for the bend) and Silhouette Analysis (pick the k with the highest score). I'd recommend using both together — the elbow narrows your candidates, and the silhouette score helps you pick the winner.

Can I use HDBSCAN for large datasets?

Absolutely. Scikit-learn's implementation is well-optimized and handles hundreds of thousands of points without breaking a sweat. For truly massive datasets (millions of rows), reduce dimensions with PCA first and consider setting algorithm="boruvka_kdtree" for faster tree construction.

Why does DBSCAN label all my points as noise?

Nine times out of ten, this means your eps is too small. Plot a k-distance graph and pick the value at the elbow. And double-check that your features are scaled — unscaled data with wildly different ranges is the most common culprit.

Should I scale my data before clustering?

Yes, almost always. KMeans, DBSCAN, and HDBSCAN all rely on distance calculations. If one feature spans 0-1000 while another spans 0-1, the big one dominates everything. A quick StandardScaler pass gives every feature equal weight and usually improves results noticeably.

About the Author Editorial Team

Our team of expert writers and editors.