Exploratory Data Analysis is the foremost step while solving a Data Science problem. EDA helps us to solve 70% of the problem. We should understand the importance of exploring the data. In general, Data Scientists spend most of their time exploring and preprocessing the data. EDA is the key to building high-performance models. In this article, I will tell you the importance of EDA and preprocessing steps you can do before you dive into modeling.
I have divided the article into two parts:
- Exploratory Data Analysis
- Data Preprocessing Steps
Exploratory Data Analysis
Exploratory Data Analysis(EDA) is an art. It’s all about understanding and extracting insights from the data. When you solve a problem using Data Science, it is very important to have domain knowledge. This helps us to get the insights better according to the business problem. We can find the magic features from the data, which boost the performance. We can do the following with EDA.
- Get comfortable with the data you are working on.
- Learn more about individual features and the relationships between various features.
- Extract insights and discover patterns from the data.
- Check whether the data makes sense and intuitive.
- Understand how the data was generated and created.
- Discover outliers and anomalies in the data.
To explore the data, there are some excellent data visualization libraries in both Python and R languages. You can use Matplolib, Seaborn or Plotly, if you use Python and ggplot, if you use R language.
Data Scientists use at least use one of the above-mentioned libraries in their day to day work. Based on my experience, I suggest you learn and keep one of them handy to make EDA quickly.
Data Preprocessing Steps
It is highly suggested to explore and preprocess the data, before you start modeling phase. The experiments say that the preprocessed data would give high performance than the raw data. Based on my experience, I have listed down the most used preprocessing steps.
- Remove duplicate data-points: If the training data is huge, it is highly possible to occur the same data-point multiple times. It’s better to remove the duplicates from the dataset, as it introduces bias during modeling.
- High correlated features: If two features are highly correlated, it is suggested to remove one of them. The two features provide the same information. You can recognize this by plotting a correlation plot along with clustering. This will cluster the highly correlated features and makes our life easier. If the correlation between two features is more than 99%, it’s better to remove. You can choose the percentage threshold, based on the problem you are solving.
- Low variance feature: If the variance of the feature is very low i.e., the feature is constant in the dataset, then remove the feature from the data. This feature cannot explain the variation in the target variable.
- Imbalance data: If we come across imbalanced datasets, we can oversample the class that has fewer data points or undersample the class that has high data points. For oversampling, we can create duplicate data points or use techniques like SMOTE. For undersampling, remove some of the similar data points.
- Treating missing values: There are different ways of handling this problem:
- If the feature has more than 40 to 50% missing values, drop the feature from the data.
- If the missing values are very less, we can drop the rows that have missing values.
- Otherwise, we can impute the missing values using mean, median (for numerical feature) or mode (for categorical feature).
- Encoding categorical features: Most of the models take numeric data as input. So, we have to convert the categorical features to numerical data. If the number of categories is less, we can use one-hot encoding/label encoding. But, if we have high cardinal features in the dataset, in that case, you can use supervised ratio. Do explore this on Google, to know more about categorical encoding.
- Feature Scaling: In general, the scale of the features in the dataset would be different from each other. In this case, some features may dominate other features. So, it is suggested to have all the features on the same scale.
- Dimensionality reduction: The datasets are very huge nowadays. If we have hundreds or thousands of features, we can use dimensionality reduction techniques. One of the popular techniques is Principal Component Analysis(PCA), in which the features are transformed into new features, which are the linear combination of original features. It is suggested to standardize the features before applying PCA.
- Other preprocessing checks:
- Make sure that the distribution of train and test sets are the same. If they are different, the entire analysis you do makes no sense. Obviously, you will end up with less performance.
- Check whether the dataset is shuffled. This will help the model to learn about different data points in one iteration.
This article teaches you the preprocessing steps for the tabular type of datasets. If you are working on text or images, the preprocessing steps would be different. I will write an article on preprocessing steps for text and images, in one of my future articles.
Hope, you got a clear idea after reading this article. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you for reading my article and supporting me. Stay tuned for my next post. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Hope, you got a clear idea after reading this article. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you for reading my article and supporting me. Stay tuned for my next post. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
Hi , i have a query if we have class imbalance problem and spliting data into train and test with straified samplin so in this case do we need oversamoling or SMOTE or stratified sampling will solve class imbalance issue
ReplyDeleteHi,
DeleteStratified sampling will just help to distribute the data into train and test set in equal proportions. But if you have two classes A and B. Class A has 1000 records whereas class B has only 50 records. In this case your model won't able to train adequately for class B. Hence you will have to generate more records for class B and hence SMOTE will be useful.
You can use an argument, class weight = ' balanced ', mostly available with all ML methods.
ReplyDeleteHello, I am working on a imbalanced dataset and I also I have used Smote to balance the data but still my RF classifier doesn't give much accuracy.
ReplyDeleteWhat can be done in this case?
Hi Abhi,
DeleteSMOTE does not improve the performance in all cases, you should try some other methods like
1. Try to increase the data or duplicate the data.
2. Use XGBoost instead of RF.
3. Use F1 score as a metric instead of accuracy.
Hope this helps.
The exploratory data provided delivered here is pretty nice. I am so happy and excited to be supplied with the more information.
ReplyDeleteThank you.
DeleteThis is a great article with lots of informative resources. I appreciate your work this is really helpful for everyone. Check out our website Apple Stock Price Forecast for more Crowdwisdom360 Private Limited related info!
ReplyDeleteLarge companies use advanced software to manage sales leads, track projects, control manufacturing, and other vital functions. They understand the investment and why it's important to not only make the right purchase but why it's important to work with their software provider on an ongoing basis. The software companies provide support for their systems but they also have a good understanding of how their software can help improve their customers business.Inetsoft.com
ReplyDeleteGreat info shared in your article about the clothing. It is a fascinating article and very easy to read. I enjoyed your article. keyword.Please keep posting this type of good article.Biometric Access Control Systems Sharjah
ReplyDeleteI read this article, it is really informative one. Your way of writing and making things clear is very impressive. Thanking you for such an informative article.Cybersecurity News Headlines Today Canada
ReplyDeleteYour post is really impressive and it has lots of knowledge in learning. Datacard Dearlers In Kenya keep share your valuable knowledge with us.
ReplyDeleteHow to play 777 Casino (Pompano Park) - MapyRO
ReplyDeleteFind 오산 출장마사지 the best 777 Casino (Pompano Park) location in 포항 출장마사지 Pompano 서귀포 출장마사지 Park and other 원주 출장마사지 places 과천 출장안마 to play.