CPSC 330 Lecture 15: DBSCAN, Hierarchical Clustering

iClicker Exercise 15.1

Select all of the following statements which are TRUE.

1. Similar to K-nearest neighbours, K-Means is a non parametric model.
1. The meaning of K in K-nearest neighbours and K-Means clustering is very similar.
1. Scaling of input features is crucial in clustering.
1. In clustering, it’s almost always a good idea to find equal-sized clusters.

Limitations of K-means

Shape of clusters

Good for spherical clusters of more or less equal sizes

K-Means: failure case 1

K-Means performs poorly if the clusters have more complex shapes (e.g., two moons data below).

K-Means: failure case 2

Again, K-Means is unable to capture complex cluster shapes.

K-Means: failure case 3

It assumes that all directions are equally important for each cluster and fails to identify non-spherical clusters.

Can we do better?

DBSCAN

Density-Based Spatial Clustering of Applications with Noise
A density-based clustering algorithm

DBSCAN

X, y = make_moons(n_samples=200, noise=0.08, random_state=42)
dbscan = DBSCAN(eps=0.2)
dbscan.fit(X)
plot_original_clustered(X, dbscan, dbscan.labels_)

How does it work?

DBSCAN Analogy

Consider DBSCAN in a social context:

Social butterflies (🦋): Core points
Friends of social butterflies who are not social butterflies: Border points
Lone wolves (🐺): Noise points

Two main hyperparameters

eps: determines what it means for points to be “close”
min_samples: determines the number of neighboring points we require to consider in order for a point to be part of a cluster

DBSCAN: failure cases

Let’s consider this dataset with three clusters of varying densities.
K-Means performs better compared to DBSCAN. But it has the benefit of knowing the value of K in advance.

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

Break

Let’s take a break!

Dendrogram

Definition: visual representation of a tree, in particular, the hierarchical representation of data…

Source

Example: Languages

Source

Hierarchical clustering

Flat clusters

This is good but how can we get cluster labels from a dendrogram?
We can bring the clustering to a “flat” format use fcluster

Linkage criteria

When we create a dendrogram, we need to calculate distance between clusters. How do we measure distances between clusters?
The linkage criteria determines how to find similarity between clusters:
Some example linkage criteria are:
- Single linkage → smallest minimal distance, leads to loose clusters
- Complete linkage → smallest maximum distance, leads to tight clusters
- Average linkage → smallest average distance between all pairs of points in the clusters
- Ward linkage → smallest increase in within-cluster variance, leads to equally sized clusters

Activity

Fill in the table below

Clustering Method	KMeans	DBSCAN	Hierarchical Clustering
Approach
Hyperparameters
Shape of clusters
Handling noise
Examples

Break

Let’s take a break!

Group Work: Class Demo & Live Coding

For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.

All credit to Dr. Varada Kolhatkar for putting this together!