DBSCAN is one of the most popular clustering algorithms after the K-means clustering algorithm. It is very good at clustering arbitrary shaped clusters like square, rectangle,..etc. K-means are effective for spherically shaped clusters, which is one of the main disadvantages. In this article, we will see how DBSCAN works.
DBSCAN is a density-based non-parametric clustering algorithm. As the name says, it clusters the data based on density i.e., the neighbouring points forms a cluster. The algorithm is also good at detecting outliers or noise. The main advantage of DBSCAN is that we need not choose the number of clusters.
Before going into the depth of the algorithm, we should be familiar with a few terms that will be used to explain the algorithm.
In K-means, k is a parameter that should be chosen by the user. Similarly, there are two parameters in DBSCAN, ϵ and MinPts. ϵ is the radius around a point and MinPts is the minimum number of points in the given ϵ.
In DBSCAN, each point can be a core point, border point or an outlier. The algorithm will classify each point into one of the above classes.
- A point is said to be a core/dense point if there are at least a minimum number of “MinPts” inside ϵ radius of that point. Euclidean distance is used as a metric.
- A point is said to be a border point if its not a core point and it should be inside the ϵ radius of at least one of the core point.
- A point is said to be an outlier if it is neither a core point nor border point.
Let’s understand the above definitions better. Consider a point in two-dimensional space, draw a circle with radius ϵ around that point and count the number of points inside the circle.
- If the count is greater than “MinPts”, then the point is said to be a core point.
- If it’s not a core point, check whether the point is inside the ϵ radius of another core point. If Yes, then it is a border point.
- Otherwise, it is an outlier/noise.
We have classified each data point either into a core point, border point or an outlier based on the parameters(ϵ and MinPts) we have chosen. We will learn a few more terms now. They are
- Direct density reachable
- Density reachable
- Density connected
Consider two points p and q.
- Direct density reachable: A point q is direct density reachable from point p if point q is inside the ϵ radius of the core point p.
- Density reachable: A point q is density reachable from point p if there is a path of direct density reachable points. For example, take 3 points p1, p2 and p3. If p2 is direct density reachable from p1 and p3 is direct density reachable from p2, then p3 is density reachable from p1. This can be extended until we reach borders of the cluster i.e, border point.
- Density connected: Two points p and q are density connected if there is a point r such that r is density reachable from both p and q. In this case, r might be a border point.
Check the below image to understand better.
Now, we have learnt everything that is required to learn how the algorithm works. Let’s understand the algorithm step by step:
- The algorithm picks a random point p, from the dataset. It ignores if it’s not a core point. If it is a core point, then all the points inside the ϵ radius of p form a cluster C. [All points that are direct density reachable from point p]
- If q is a point in the cluster C and q is a core point, then all the points inside the ϵ radius of q are merged into the cluster C. We expand the cluster to the maximum extent possible. [All points that are density reachable from point p]
- Sometimes, a border point r might be a border point to different clusters C1 and C2. Then, both the clusters C1 and C2 are merged into one cluster.
- Repeat the above steps until all the points in the dataset are visited. The points that are not in any of the clusters are outliers/noise points.
In the below image, you can see the output of the DBSCAN algorithm.
If you analyze the above image, we can clearly see that DBSCAN is able to cluster the data points in the way we expect. We can’t do this by K-means. It is able to detect outliers too.
There will be advantages and disadvantages to every algorithm. Let’s discuss them now.
Advantages:
- It can cluster arbitrary shaped clusters.
- It is very robust to outliers/noise points.
- We need not specify the number of clusters.
Disadvantages:
- We must choose the ϵ and MinPts carefully. We can choose this if we understand the data thoroughly. But, this needs a few iterations to come up with a good set of parameters.
- If the density is not the same across the data, then this will be a big problem. We may end up with bad clusters.
You can use the scikit-learn library to easily import DBSCAN and use. Check this link.
I hope you enjoyed reading this and learnt something new. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Thanks for your compliment. Please don't spam it.
ReplyDelete