Data Science is not just modelling. To extract value out from Data Science, it needs to be integrated with business and deploy the product to make it available for the users. To build a Data Science product, it needs to go through several steps. In this article, I will discuss the complete Data Science pipeline.
Steps involved in building a Data Science product:
- Understanding the Business problem
- Data Collection
- Data Cleaning
- Exploratory Data Analysis
- Modelling
- Deployment
Let us discuss each step in detail.
Understanding the business problem:
We use Data Science to solve a problem. Without understanding the problem, we can’t apply data science and solve it. Understanding the business is very important in building a data science product. The model which we build completely depends on the problem we are solving. If the requirement is different, we need to adjust our algorithm such that it solves the problem.
For example, if we are building a recommendation system to recommend products. Here, the main goal of the problem is to recommend products to new users. But, if we build a recommendation engine that doesn’t care about this constraint, then it’s no use. The recommendations should be completely different for new users and existing users. We have to build a customized algorithm to recommend products to new users. We cannot use the same algorithm for all types of problems. This is the reason why understanding the business problem is very important.
Data Collection:
After understanding the business problem, we should collect data accordingly. If we have more data, we can build robust models. The data should be accurate. If the data itself has more outliers, it’s of no use even if we build a sophisticated model. So, we should be more careful here. Data collection is boring and difficult. It involves a lot of patience and time.
If we consider the above example, we should collect the data from new customers who are using the platform for the first time and track their clicks. We should not use the data of existing customers. This will add noise to the data. So, data collection is a very important step in solving a problem.
Data Cleaning:
After collecting the data, it should be processed, cleaned and stored in a structured form. Here, domain knowledge plays an important role. We should be able to detect the outliers based on the business problem. We should extract the features that make sense from a business perspective. The more noise we remove, the more robust model can be built. The cleaned data will be directly used for analysis and modelling.
For example, if we are scraping data from a website, it will include Html tags. We should remove them because they are not necessary. In some cases, we cannot extract data for all the fields, in that case, we should replace with NA. We should handle null values and errors. If it is image data, we should remove noisy images. So, cleaning data is one of the important phases in the pipeline.
Python, R languages can be used for Data cleaning. If the data is very huge, we can use SQL.
Exploratory Data Analysis:
After creating the data, we should start analyzing it. We should extract insights and hidden information from the data and relate them with business. Visualization tools help us a lot in this step. We should spend more time in understanding the data. Business understanding helps a lot. We can create new features that make sense. We have to make sure that the distribution of train and test (real world) should be the same.
Some visualization libraries are Matplotlib, Seaborn and GGplot2. Pandas library would be very helpful for exploring data.
Refer this article to know more about EDA and Data Preprocessing steps.
Modelling:
We use Machine learning and Deep learning algorithms to solve the problem. This is more exciting and fun to build different models. Think model as a black box that converts the input data into the output. We should try different methods and choose accordingly based on performance. We cannot use the same method for solving all the problems. The business-related features will be more helpful in creating robust models.
We can use Scikit-learn for building Machine learning models and Keras/TensorFlow for Deep learning models. There are many deep learning frameworks out there, compare them in different aspects before using them.
Deployment:
After building a model, it should be accessible to the users and scalable. So, we need to deploy the model using AWS, Google Cloud,..etc. This is the last step of the pipeline. Once we have deployed it, we have successfully built a Data Science product that can be used by the end-user.
In conclusion, we cannot avoid any step in the pipeline. Each stage is important in building a successful product. In this article, you have learned different steps that are involved in the Data Science pipeline.
Thank you for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog. If you have queries, please do comment in the comment section below. I will be more than happy to help you. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
It is very important to know the problem. If you don't know it, you won't find the solution for it. Will you agree with that?
ReplyDeleteYes. Completely agree with that.
DeleteIt was great knowledge after reading this. Thanks for sharing such good stuff with us. I am pleased for sharing on this blog its remarkable blog I really fascinated. Otherwise If anyone Want to learn Basic to Adv. MIS Training & Complete Data Science So Contact here-9311002620.
ReplyDeleteCertified MIS & Data Science Training Center in Delhi, India
MIS Training Institute in Noida
MIS Training Institute in Delhi
MIS Training Institute in Faridabad
Awesome post! This is helpful post. Article is very clear and with lots of useful information. Thanks for such nice article
ReplyDeleteVisit : https://pythontraining.dzone.co.in/training/data-science-training.html
Hi there to everybody, it’s my first go to see of this web site; this weblog consists of awesome and in fact good stuff for visitors. Hurrah, that’s what I was exploring for, what stuff! Existing here at this blog, thanks admin of this web site. You can also visit Google News Api for more SERP House related information and knowledge
ReplyDeleteThis is really good, i love this content also visit LinkedIn Lead Extractor. Thanks for sharing.
ReplyDeleteLead ExtraxThis is really good, i love this content also visit LinkedIn Lead Extractor. Thanks for sharing.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteI would like to thank you for sharing such an informative post. The post is very informative as it contains some best knowledge. Thanks for posting it. Keep it up. GAS DISTRIBUTION PIPE Brantford.
ReplyDeleteThanks for sharing this post its very informative post by the way If anyone look for Ms Office training institute in Delhi Contact Here-+91-9311002620 Or Visit our website https://www.htsindia.com/Courses/microsoft-courses/ms-office-course
ReplyDeleteI really appreciate you saying it’s an interesting post to read. I learn new information from your blog, you are doing a great job. Thank you for sharing meaningful information
ReplyDeleteGet Genuine Experience Certificate with Reasonable Cost
Want Fill Your Career GAP ! Call Us & Get Genuine Experience Certificate