Skip to main content


Building ML Pipelines using Scikit Learn and Hyper Parameter Tuning

Data Scientists often build Machine learning pipelines which involves preprocessing (imputing null values, feature transformation, creating new features), modeling, hyper parameter tuning. There are many transformations that need to be done before modeling in a particular order. Scikit learn provides us with the Pipeline class to perform those transformations in one go. Pipeline serves multiple purposes here (from documentation ): Convenience and encapsulation : You only have to call fit and predict once on your data to fit a whole sequence of estimators. Joint parameter selection : You can grid search over parameters of all estimators in the pipeline at once (hyper-parameter tuning/optimization). Safety : Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors. In this article, I will show you How to build a complete pi
Recent posts

A year of experience as a Data Scientist

On June 3rd 2019, I joined ZS Associates as a Data Scientist after graduating from IIIT SriCity. It was my first job and was very happy to get placed as a Data Scientist through lateral hiring. If you haven’t read my Data Science journey, please read it here :) After joining, I had some awesome moments that I never experienced since childhood. I got a chance to stay in a 4 star or 5 star hotel multiple times. I got a chance to travel by flight. I travelled to Pune, Delhi and Bangalore. I saw Vizag, Pune, Delhi and Bangalore airports in less than six months. I loved it. A few office parties, outings during Diwali and New year celebrations. Above are some of the moments that I can never forget in my life. My first job allowed me to experience these first time moments. Enjoying life is more important than anything. If you don’t enjoy your life, you cannot achieve anything big. Okay, let’s go into the main topic in detail. Me (inner voice during BTech):

Lessons Learned Using Dask - Best Practices

Dask is open source and freely available. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. It is developed in coordination with other community projects like Numpy, Pandas and Scikit-Learn. The syntax of Dask is the same as Pandas. It doesn’t take much time for you to get onboarded if you are familiar with Pandas. But, it’s not as simple as you think. You need to understand how Dask works and what are the functions or parameters that were implemented. Introduction For example, you have sort_values function in Pandas, which helps you to sort the data frame based on a column. In Dask, you don’t have a similar function because Dask stores the data in multiple partitions. To sort the data frame, it has to bring the whole data in-place and then sort. We cannot do this, because the data may not fit entirely in RAM and might fail if it doesn’t. What developers suggest here is to set the column as an index and this will en

Introduction to Pandas, Dask and PySpark. Which one to use?

If you are a data scientist, you must have used Pandas at least once. Pandas is a very simple, flexible and powerful library for data wrangling and modeling. In this article, we will discuss which one to use among Pandas, Dask and Pyspark. The answer to the above question is “It depends on the data, resources and the objective”. Let’s discuss one by one. Pandas: Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. If the data size is around 1-5 GB, we can use Pandas assuming that we have enough RAM (>=16 GB). Please note that these numbers are relative and depend on the processing you are doing. One of the limitations of using Pandas is that it expects the entire data to fit in RAM. In big data scenarios, we will have data in GBs. If we don’t have enough RAM, we cannot do the processing. If we have enough RAM and after we complete the processing, the data becomes