Skip to main content

Posts

Showing posts from June, 2020

Lessons Learned Using Dask - Best Practices

Dask is open source and freely available. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. It is developed in coordination with other community projects like Numpy, Pandas and Scikit-Learn. The syntax of Dask is the same as Pandas. It doesn’t take much time for you to get onboarded if you are familiar with Pandas. But, it’s not as simple as you think. You need to understand how Dask works and what are the functions or parameters that were implemented. Introduction For example, you have sort_values function in Pandas, which helps you to sort the data frame based on a column. In Dask, you don’t have a similar function because Dask stores the data in multiple partitions. To sort the data frame, it has to bring the whole data in-place and then sort. We cannot do this, because the data may not fit entirely in RAM and might fail if it doesn’t. What developers suggest here is to set the column as an index and this will en

Introduction to Pandas, Dask and PySpark. Which one to use?

If you are a data scientist, you must have used Pandas at least once. Pandas is a very simple, flexible and powerful library for data wrangling and modeling. In this article, we will discuss which one to use among Pandas, Dask and Pyspark. The answer to the above question is “It depends on the data, resources and the objective”. Let’s discuss one by one. Pandas: Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. If the data size is around 1-5 GB, we can use Pandas assuming that we have enough RAM (>=16 GB). Please note that these numbers are relative and depend on the processing you are doing. One of the limitations of using Pandas is that it expects the entire data to fit in RAM. In big data scenarios, we will have data in GBs. If we don’t have enough RAM, we cannot do the processing. If we have enough RAM and after we complete the processing, the data becomes