If you are a data scientist, you must have used Pandas at least once. Pandas is a very simple, flexible and powerful library for data wrangling and modeling. In this article, we will discuss which one to use among Pandas, Dask and Pyspark.
The answer to the above question is “It depends on the data, resources and the objective”. Let’s discuss one by one.
Pandas:
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
If the data size is around 1-5 GB, we can use Pandas assuming that we have enough RAM (>=16 GB). Please note that these numbers are relative and depend on the processing you are doing.
One of the limitations of using Pandas is that it expects the entire data to fit in RAM. In big data scenarios, we will have data in GBs. If we don’t have enough RAM, we cannot do the processing. If we have enough RAM and after we complete the processing, the data becomes smaller and we don’t require extra RAM anymore. We end up wasting resources.
For most of the problems, we don’t require big data solutions. We can use Pandas smoothly with enough RAM.
Dask:
Dask is open source and freely available. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love. It is developed in coordination with other community projects like Numpy, Pandas and Scikit-Learn.
For example, assume that you have a laptop with 4GB RAM and 5 GB data to work on. Can you use Pandas? In some way Yes. There is a way to load data in chunks by using the chunksize parameter in Pandas read function. But, we cannot parallelize the tasks. For a smooth run, it would be good if we have at least 5 GB RAM. In this case, we have to replace/buy a new laptop or increase RAM.
Now, assume that you have an extra laptop with 4 GB RAM. In total, we have two laptops with 4GB RAM each and 5 GB data. Dask framework comes to rescue. Using Dask, we can parallelize and distribute tasks to multiple nodes/laptops. We need not upgrade our laptop :)
Even if you don’t have an extra laptop, you can easily scale using Dask on one laptop using multiple cores for computation and their disk for excess storage.
The question is “How much data can it scale to?”. There is no exact answer to this question. But, experts say that it can scale up to hundreds of GB of data. Under the hood, Dask uses Pandas data frames to perform operations and the syntax is similar to Pandas.
PySpark:
PySpark is the Python API for Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
If you have Terabytes and petabytes of data, you can close your eyes and choose Pyspark. The spark.ml is a package that helps to build models and create pipelines.
The disadvantage is that you may not find all the algorithms that are in sklearn. For example, if you take XGBoost, there is a spark version which you can use. Some of the algorithms are implemented in spark and some of them are not. Here, Pandas looks better as it is easy to integrate with other libraries.
Summary
In summary, there is no particular library or framework that I recommend. Each of them has its advantages and disadvantages. Pandas cannot scale more than RAM. Dask and PySpark can scale up to GBs of data. Pandas can be integrated with many libraries easily and Pyspark cannot.
What I suggest is that, do pre-processing in Dask/PySpark. Once the data is reduced or processed, you can switch to pandas in both scenarios, if you have enough RAM. If you check the Dask documentation, they suggest switching to the Pandas data frame if it fits in the RAM. If the data size is less, use Pandas otherwise Dask/Pyspark adds the overhead time of data transfer, scheduling tasks,etc.
The above content is based on my experience and understanding.
Hope you got a clear idea after reading this blog. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you for reading my blog and supporting me. Stay tuned for my next post. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Nice,Thanks for sharing.
ReplyDeleteThank you so much for sharing . It helped me very much
ReplyDeletehttps://aditidigitalsolutions.com/data-science-training-hyderabad/
I appreciate your hard work, I will keep visiting it.
ReplyDeletehttps://aditidigitalsolutions.com/data-science-training-hyderabad/
ReplyDeleteReally good information, this information is excellent and essential for everyone. I am very very thankful to you for providing this kind of information.
https://aditidigitalsolutions.com/data-science-training-hyderabad/
It's very nice of you to share your knowledge through posts. I love to read stories about your experiences. They're very useful and interesting. I am excited to read the next posts. I'm so grateful for all that you've done. Keep plugging. Many viewers like me fancy your writing. Thank you for sharing precious information with us. Best osi model service provider
ReplyDelete