Dask is open source and freely available. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. It is developed in coordination with other community projects like Numpy, Pandas and Scikit-Learn.
The syntax of Dask is the same as Pandas. It doesn’t take much time for you to get onboarded if you are familiar with Pandas. But, it’s not as simple as you think. You need to understand how Dask works and what are the functions or parameters that were implemented.
Introduction |
For example, you have sort_values function in Pandas, which helps you to sort the data frame based on a column. In Dask, you don’t have a similar function because Dask stores the data in multiple partitions. To sort the data frame, it has to bring the whole data in-place and then sort. We cannot do this, because the data may not fit entirely in RAM and might fail if it doesn’t. What developers suggest here is to set the column as an index and this will ensure that the data frame is sorted based on that column.
I will give you one more example. You must have used the drop_duplicates function in Pandas. There is a parameter keep through which you can specify whether to keep the first occurrence(“first”) or the last occurrence (“last”). In Dask, you can pass the parameters, but it will not work. It will randomly select a row. If you don’t QC the results, you are at loss. If you are new to Dask, I suggest referring to the documentation of the function you are using until you get familiar with what works and what doesn’t.
It’s difficult to implement some functions(like the above) by managing memory efficiently across nodes and producing accurate results. We have to agree on losing one of them. I hope you got the gist of what I am trying to communicate here. Let’s get started.
Here are some of the best practices you can follow while using Dask:
- If you know that the processed data would fit in RAM, convert the Dask dataframe into Pandas data frame. Using Dask operations on small data unnecessarily increases overhead time.
- Don’t load the entire data frame into one partition. Make use of the blocksize parameter in the dask read_csv function. You can also store the data in multiple CSV or parquet files which helps the dask to load the data in multiple partitions.
- Use dask functions to load data. When I was new to Dask, I loaded the large data through Pandas and then converted to Dask, which is inefficient. If you load directly using Dask, it will maintain those chunks and partitions. Otherwise, it has to move large objects with its metadata.
- Unlike Pandas, Dask does lazy execution. It means that it doesn’t execute the function or graph until we explicitly call compute on dask objects. This will help Dask intelligently manage memory while parallelizing the tasks. At the same time, lazy execution increases the complexity of debugging. Please note that the compute method will create a single Pandas or Numpy result. So, you should carefully call the compute method. Make sure that the computed result will surely fit in the memory.
- Be careful while choosing the number of partitions or chunksize. The chunksize shouldn’t be too small or very large when compared to the memory you have. If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small.
- [This was very useful] Use the persist function instead of compute if you are working on a cluster with distributed memory. When you call compute, it returns a single Pandas or Numpy object which expects to fit in the memory. When you persist, it triggers the computation and returns the Dask object which uses distributed memory across nodes.
- When you want to parallelize for loop, you can make use of Dask delayed. This will delay the operations inside the loop and execute all of them parallelly. You can make use of delayed in many other ways. Please check this.
- Last but not the least, Use the Dask dashboard. This was helpful for me. It provides information on memory usage, parallel processing of nodes, and task graph execution. Sometimes Dask keeps hanging in the middle. Because of lazy execution, we cannot predict what the issue is. In this case, the dask dashboard helps in debugging the issue. I highly suggest taking advantage of this.
These are some of the lessons learned while using Dask. If you are starting, please go through the documentation multiple times. They also have a YouTube channel to get started. Please find the below resources that you can read further.
Next steps:
Best Practices: https://docs.dask.org/en/latest/best-practices.html
Dataframe best practices
Array best practices
Delayed best practices
Dask - YouTube
Hope you got a clear idea after reading this article. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you so much for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog.
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Really good information, this information is excellent and essential for everyone. I am very very thankful to you for providing this kind of information.
ReplyDeletehttps://aditidigitalsolutions.com/data-science-training-hyderabad/
This comment has been removed by the author.
ReplyDeletehttps://www.inetsoft.com is a web-based reporting and analytics tool. It is one of the best reporting software that helps you to perform data aggregation and create user-friendly detailed reports.
ReplyDelete