Clustering is one of the techniques used to group the objects such that similar objects are in the same cluster. The objects in the same cluster are similar and vice versa. Clustering is widely used in the industry to solve problems.
This article is organized as follows:
- Introduction
- Divisive and Agglomerative clustering
- Agglomerative clustering
- K means vs Hierarchical clustering
Introduction:
Hierarchical clustering is one of the most popular clustering algorithms. This algorithm builds a hierarchy of clusters. There are two different methods of hierarchical clustering, Divisive and Agglomerative. Please refer to the below image to get a sense of how hierarchical clusters look.
Hierarchical clustering |
Divisive and Agglomerative clustering:
The divisive method is a top-down clustering method in which we assign all the data points to a single cluster. We divide the single cluster into two and we go on dividing the sub-clusters until we reach n clusters (each cluster having a single data point).
The agglomerative method is a bottom-up clustering method in which we assign all the data points to different clusters. Initially, we have n clusters each having a single data point. We combine clusters until all the data points belong to a single cluster.
Hierarchical clustering - Divisive vs Agglomerative methods |
Our main focus of this article is Agglomerative hierarchical clustering method. Let’s discuss in depth.
Agglomerative clustering:
As I discussed before, this is a bottom-up method. If you observe the diagram, we start from down where all the data points are separate and as we go up we combine the data points into clusters until we get a single cluster. Let’s discuss the algorithm.
- Assign all data points into separate clusters (n clusters). To understand better, let’s assume n=6 and the data point is represented using an m dimensional vector.
- Find the most similar clusters (let’s say p1, p2). We combine them into a cluster. Initially, we had 6 clusters, now we have 5 clusters [(p1, p2), p3, p4, p5].
- Repeat step 2, (n-1) times. In this case, we repeat 5 times to get hierarchical clusters that were shown in the diagram.
Hierarchical clustering - Different k values |
Here is the step by step process of hierarchical clustering for the example shown in diagram:
- [p1, p2, p3, p4, p5, p6] → 6 clusters
- [(p1, p2), p3, p4, p5, p6] → 5 clusters
- [(p1, p2), p3, (p4, p5), p6] → 4 clusters
- [(p1, p2, p3), (p4, p5), p6] → 3 clusters
- [(p1, p2, p3, p4, p5), p6] → 2 clusters
- [(p1, p2, p3, p4, p5, p6)] → 1 cluster
Hierarchical clustering k=3 |
Challenges:
There are two challenges here and you must have thought about those.
There are two challenges here and you must have thought about those.
- Which metric to use for calculating the similarity between two clusters?
- How to calculate the similarity between clusters? There are many cases possible here. If each cluster has only one data point, we can easily calculate similarity. If we have 3 data points in a cluster, and you are calculating the similarity with another cluster having only one single data point. The solution is to represent a cluster with a single vector or data point.
Which metric to use?
We can use Euclidean distance or Manhattan distance as a metric to calculate the distance between the clusters. We have to choose a metric based on the use case.
How to calculate the similarity between clusters?
There are three different methods:
There are three different methods:
- Single linkage: In single linkage hierarchical clustering, the distance between the two clusters is the shortest distance between two points (one point from the first cluster and the other from the second cluster). For example, you have 2 points in the first cluster and 3 points in the second cluster. There are 6 combinations. Now, calculate the distance between all combinations using the chosen metric. You have calculated 6 distances. The lowest number of the 6 numbers is the similarity between the two clusters.
- Complete linkage: In complete-linkage hierarchical clustering, the distance between the two clusters is the longest distance between two points (one point from the first cluster and the other from the second cluster). Consider the above example, the largest number of the 6 numbers is the similarity between the two clusters.
- Average linkage: In average linkage hierarchical clustering, the distance between the two clusters is the average distance between all combinations of points (one point from the first cluster and the other from the second cluster). Consider the above example, calculate the average of all the 6 numbers. The average number is the similarity between the two clusters.
Now, read the above 3 steps to connect all the dots and to understand the algorithm better.
K means vs Hierarchical clustering:
If you don’t know the K-means clustering algorithm, I highly suggest you refer to this blog.
If you don’t know the K-means clustering algorithm, I highly suggest you refer to this blog.
- If we run K-means multiple times, we get different results. But, hierarchical clustering gives you the same results even if we run the algorithm multiple times. This is because in K-means we initialize the cluster centroids randomly.
- K-means can handle big data. It takes O(n). On the other side, hierarchical clustering cannot handle huge data because we have to calculate the distances between a huge number of combinations. The time complexity of hierarchical clustering is O(n^2).
There are many other clustering algorithms. You can read K-means and DBSCAN clustering algorithms as a next step.
Thank you so much for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog. If you have any queries, please do comment in the comment section below. I will be more than happy to help you. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Hello Abhishek, Why is it that Divisive Clustering is not more popular? And do you see some similarity in Divisive Clustering & Decision Tree?
ReplyDelete