Skip to main content

Exploratory Data Analysis and Data Preprocessing steps

Exploratory Data Analysis is the foremost step while solving a Data Science problem. EDA helps us to solve 70% of the problem. We should understand the importance of exploring the data. In general, Data Scientists spend most of their time exploring and preprocessing the data. EDA is the key to building high-performance models. In this article, I will tell you the importance of EDA and preprocessing steps you can do before you dive into modeling.

I have divided the article into two parts:
  1. Exploratory Data Analysis
  2. Data Preprocessing Steps

Exploratory Data Analysis

Exploratory Data Analysis(EDA) is an art. It’s all about understanding and extracting insights from the data. When you solve a problem using Data Science, it is very important to have domain knowledge. This helps us to get the insights better according to the business problem. We can find the magic features from the data, which boost the performance. We can do the following with EDA.
  • Get comfortable with the data you are working on.
  • Learn more about individual features and the relationships between various features.
  • Extract insights and discover patterns from the data.
  • Check whether the data makes sense and intuitive.
  • Understand how the data was generated and created.
  • Discover outliers and anomalies in the data.
To explore the data, there are some excellent data visualization libraries in both Python and R languages. You can use Matplolib, Seaborn or Plotly, if you use Python and ggplot, if you use R language.

Data Scientists use at least use one of the above-mentioned libraries in their day to day work. Based on my experience, I suggest you learn and keep one of them handy to make EDA quickly.

Data Preprocessing Steps

It is highly suggested to explore and preprocess the data, before you start modeling phase. The experiments say that the preprocessed data would give high performance than the raw data. Based on my experience, I have listed down the most used preprocessing steps.
  • Remove duplicate data-points: If the training data is huge, it is highly possible to occur the same data-point multiple times. It’s better to remove the duplicates from the dataset, as it introduces bias during modeling.

  • High correlated features: If two features are highly correlated, it is suggested to remove one of them. The two features provide the same information. You can recognize this by plotting a correlation plot along with clustering. This will cluster the highly correlated features and makes our life easier. If the correlation between two features is more than 99%, it’s better to remove. You can choose the percentage threshold, based on the problem you are solving.

  • Low variance feature: If the variance of the feature is very low i.e., the feature is constant in the dataset, then remove the feature from the data. This feature cannot explain the variation in the target variable.

  • Imbalance data: If we come across imbalanced datasets, we can oversample the class that has fewer data points or undersample the class that has high data points. For oversampling, we can create duplicate data points or use techniques like SMOTE. For undersampling, remove some of the similar data points.

  • Treating missing values: There are different ways of handling this problem:
    1. If the feature has more than 40 to 50% missing values, drop the feature from the data.
    2. If the missing values are very less, we can drop the rows that have missing values.
    3. Otherwise, we can impute the missing values using mean, median (for numerical feature) or mode (for categorical feature).

  • Encoding categorical features: Most of the models take numeric data as input. So, we have to convert the categorical features to numerical data. If the number of categories is less, we can use one-hot encoding/label encoding. But, if we have high cardinal features in the dataset, in that case, you can use supervised ratio. Do explore this on Google, to know more about categorical encoding.

  • Feature Scaling: In general, the scale of the features in the dataset would be different from each other. In this case, some features may dominate other features. So, it is suggested to have all the features on the same scale.

  • Dimensionality reduction: The datasets are very huge nowadays. If we have hundreds or thousands of features, we can use dimensionality reduction techniques. One of the popular techniques is Principal Component Analysis(PCA), in which the features are transformed into new features, which are the linear combination of original features. It is suggested to standardize the features before applying PCA.

  • Other preprocessing checks:
    1. Make sure that the distribution of train and test sets are the same. If they are different, the entire analysis you do makes no sense. Obviously, you will end up with less performance.
    2. Check whether the dataset is shuffled. This will help the model to learn about different data points in one iteration.
This article teaches you the preprocessing steps for the tabular type of datasets. If you are working on text or images, the preprocessing steps would be different. I will write an article on preprocessing steps for text and images, in one of my future articles.

Hope, you got a clear idea after reading this article. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.

Thank you for reading my article and supporting me. Stay tuned for my next post. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!

Follow me here:
If you are looking for any specific blog, please do comment in the comment section below.


  1. Hi , i have a query if we have class imbalance problem and spliting data into train and test with straified samplin so in this case do we need oversamoling or SMOTE or stratified sampling will solve class imbalance issue

    1. Hi,
      Stratified sampling will just help to distribute the data into train and test set in equal proportions. But if you have two classes A and B. Class A has 1000 records whereas class B has only 50 records. In this case your model won't able to train adequately for class B. Hence you will have to generate more records for class B and hence SMOTE will be useful.

  2. You can use an argument, class weight = ' balanced ', mostly available with all ML methods.

  3. Hello, I am working on a imbalanced dataset and I also I have used Smote to balance the data but still my RF classifier doesn't give much accuracy.
    What can be done in this case?

    1. Hi Abhi,

      SMOTE does not improve the performance in all cases, you should try some other methods like

      1. Try to increase the data or duplicate the data.
      2. Use XGBoost instead of RF.
      3. Use F1 score as a metric instead of accuracy.

      Hope this helps.

  4. The exploratory data provided delivered here is pretty nice. I am so happy and excited to be supplied with the more information.

  5. This is a great article with lots of informative resources. I appreciate your work this is really helpful for everyone. Check out our website Apple Stock Price Forecast for more Crowdwisdom360 Private Limited related info!

  6. Large companies use advanced software to manage sales leads, track projects, control manufacturing, and other vital functions. They understand the investment and why it's important to not only make the right purchase but why it's important to work with their software provider on an ongoing basis. The software companies provide support for their systems but they also have a good understanding of how their software can help improve their customers

  7. Great info shared in your article about the clothing. It is a fascinating article and very easy to read. I enjoyed your article. keyword.Please keep posting this type of good article.Biometric Access Control Systems Sharjah

  8. I read this article, it is really informative one. Your way of writing and making things clear is very impressive. Thanking you for such an informative article.Cybersecurity News Headlines Today Canada

  9. Your post is really impressive and it has lots of knowledge in learning. Datacard Dearlers In Kenya keep share your valuable knowledge with us.

  10. How to play 777 Casino (Pompano Park) - MapyRO
    Find 오산 출장마사지 the best 777 Casino (Pompano Park) location in 포항 출장마사지 Pompano 서귀포 출장마사지 Park and other 원주 출장마사지 places 과천 출장안마 to play.


Post a Comment

Popular posts from this blog

Google Colab - Increase RAM upto 25GB

Google colab is a free jupyter notebook that is hosted on Google cloud servers. We can use CPU, GPU and TPU for free. It helps you to write and execute your code. You can directly access this through your browser. If you want to use Google Cloud/AWS, it requires hell lot of setup. You have to spin a cluster, create a notebook and then use it. But, Google colab will be readily available for you to use it. You can also install libraries from the notebook itself. These notebooks are very useful for training large models and processing huge datasets. Students and developers can make use of this because it’s very difficult for them to afford GPUs and TPUs. I was trying to run a memory heavy job. The notebook crashed. Then, I came to know how I can increase the RAM. So, I thought of sharing it in my blog. There are some constraints with the notebook. You can run these notebooks for not more than 12 hours and you can use only 12 GB RAM. There is no direct method or button t

Skills required to become a Data Scientist

Data Science is one of the hottest areas in the 21st century. We can solve many complex problems using a huge amount of information. The way electricity has changed the world, information helps us to make our lives easier and comfortable. Every second, an enormous amount of data is being generated. The data may be in the form of text, image, speech or tabular. As there is a lot of growth in the field of Data Science, in recent years, most of the companies have started building their own Data Science teams to get benefited from the information they have. This has created a lot of opportunities and demand for Data Science in different domains. For the next 5+ years, this demand would continue to increase. If we have the right skills, companies are ready to offer salaries more than the market standards. So, this is the right time to explore and gain skills which enables you to enter into this field. We have discussed the importance and demand for data science in the market. Let’s disc

Top 35 frequently asked Data Science interview questions

Interviews are very stressful. We should prepare for the worse. So, we have to plan accordingly in order to crack them. In this blog, you will get to know the type of questions that will be asked during the interview. It also depends on the experience level and the company too. This blog is mainly focused on entry-level Data Science related jobs. If you haven’t read my previous blog-posts, I highly recommend you to go through them: Skills required to become a Data Scientist How to apply for a Data Science job? First of all, you must be thorough with your resume, mainly your Internship experience and academic projects. You will have at least one project discussion round. Take mock interviews and improve your technical and presentation skills, which will surely help in the interviews. Based on my experience, I have curated the topmost 35 frequently asked Data Science questions during the interviews. Explain the Naive Bayes classifier? In case of Regression, how do y

My Data Science Journey and Suggestions - Part 1

I always wanted to share my detailed Data Science journey. So, I have divided the whole journey from BTech first year to final year into 3 parts. I will share everything, without leaving a single detail, starting from my projects, internships to getting a job. You can follow the path that I have followed if you like my journey or create your own path. In 2015, I got a seat in Electronics and Communication Engineering (ECE), IIIT Sri City through IIT JEE Mains. Because of my rank in JEE Mains, I couldn’t get into the Computer Science department. I wanted to shift to Computer Science after my first year, but couldn’t due to some reasons. In our college, we have only two branches, CSE and ECE. For the first three semesters, the syllabus was the same for both the departments except for a few courses. This helped me to explore Computer Science. In the first 3 semesters, I took Computer Programming, Data Structures, Algorithms, Computer Organization, Operation Systems courses, wh

Building ML Pipelines using Scikit Learn and Hyper Parameter Tuning

Data Scientists often build Machine learning pipelines which involves preprocessing (imputing null values, feature transformation, creating new features), modeling, hyper parameter tuning. There are many transformations that need to be done before modeling in a particular order. Scikit learn provides us with the Pipeline class to perform those transformations in one go. Pipeline serves multiple purposes here (from documentation ): Convenience and encapsulation : You only have to call fit and predict once on your data to fit a whole sequence of estimators. Joint parameter selection : You can grid search over parameters of all estimators in the pipeline at once (hyper-parameter tuning/optimization). Safety : Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors. In this article, I will show you How to build a complete pi

SHAP - An approach to explain the output of any ML model (with Python code)

Can we explain the output of complex tree models? We use different algorithms to improve the performance of the model. If you input a new test datapoint into the model, it will produce an output. Did you ever explore which features are causing to produce the output? We can extract the overall feature importance from the model, but can we get which features are responsible for the output? If we use a decision tree, we can at least explain the output by plotting the tree structure. But, it’s not easy to explain the output for advanced tree-based algorithms like XGBoost, LightGBM, CatBoost or other scikit-learn models. To explain the output for the above algorithms, researches have come up with an approach called SHAP. SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate ad

Latent Dirichlet Allocation - LDA (With Python code)

Latent Dirichlet Allocation, also known as LDA, is one of the most popular methods for topic modelling. Using LDA, we can easily discover the topics that a document is made of. LDA assumes that the documents are a mixture of topics and each topic contain a set of words with certain probabilities. For example, consider the below sentences: Apple and Banana are fruits. I bought a bicycle recently. In less than two years, I will buy a bike. The colour of the apple and bicycle are red. The output of LDA would look like this: Topic 1 : 0.7*apple + 0.3*banana Topic 2 : 0.6*bicycle + 0.4*bike Sentence 1 : [(Topic 1, 1), (Topic 2, 0)] Sentence 2 : [(Topic 1, 0), (Topic 2, 1)] Sentence 3 : [(Topic 1, 0.5), (Topic 2, 0.5)] Please note that the above probabilities are made up numbers for intuition. To extract the topics and probability of words using LDA, we should decide the number of topics (k) beforehand. Based on that, LDA discovers the topic distribution of docum

A year of experience as a Data Scientist

On June 3rd 2019, I joined ZS Associates as a Data Scientist after graduating from IIIT SriCity. It was my first job and was very happy to get placed as a Data Scientist through lateral hiring. If you haven’t read my Data Science journey, please read it here :) After joining, I had some awesome moments that I never experienced since childhood. I got a chance to stay in a 4 star or 5 star hotel multiple times. I got a chance to travel by flight. I travelled to Pune, Delhi and Bangalore. I saw Vizag, Pune, Delhi and Bangalore airports in less than six months. I loved it. A few office parties, outings during Diwali and New year celebrations. Above are some of the moments that I can never forget in my life. My first job allowed me to experience these first time moments. Enjoying life is more important than anything. If you don’t enjoy your life, you cannot achieve anything big. Okay, let’s go into the main topic in detail. Me (inner voice during BTech):

Complete Data Science Pipeline

Data Science is not just modelling. To extract value out from Data Science, it needs to be integrated with business and deploy the product to make it available for the users. To build a Data Science product, it needs to go through several steps. In this article, I will discuss the complete Data Science pipeline. Steps involved in building a Data Science product: Understanding the Business problem Data Collection Data Cleaning Exploratory Data Analysis Modelling Deployment Let us discuss each step in detail. Understanding the business problem: We use Data Science to solve a problem. Without understanding the problem, we can’t apply data science and solve it. Understanding the business is very important in building a data science product. The model which we build completely depends on the problem we are solving. If the requirement is different, we need to adjust our algorithm such that it solves the problem. For example, if we are build