Feature Engineering is one of the most important steps in solving a Data Science problem. This helps us to create a robust and high-performance model. This is the reason why Data Scientists spend time in understanding the data and preprocessing it before diving into modelling.
In this article, I will explain some feature engineering steps that I have learned from the course “How to Win a Data Science Competition: Learn from Top Kagglers”. If you have time, I recommend you to take this course.
Feature Engineering involves mainly feature preprocessing and feature generation. Most of the real-world data is noisy. We cannot expect all the features to be numerical. It may also contain strings, timestamp features etc. it is suggested to convert all the features into numerical and categorical features (some models can handle categorical features too). Using the existing features, we can also create new features based on domain knowledge. They might contribute to predicting the target.
In my previous article, I have discussed the importance of EDA and some preprocessing steps such as removing duplicated data-points, handling highly correlated features, low variance features, imbalance data, treating missing values, encoding categorical features, feature scaling and dimensionality reduction techniques. I recommend reading my previous article before diving into this article.
When we work on tabular data, we come across mainly numeric, continuous and time-related features. Let’s discuss data engineering techniques in detail for each data type.
Data Engineering for Numerical features (continuous):
The continuous features may range from (-inf to inf). It depends on the feature. For example, age, height, salary, population,.etc comes under continuous features. Some preprocessing techniques are:
- Feature scaling: Preprocessing continuous features depends on which model we are planning to apply to the data. If it is a tree-based model, we need not worry about scaling. If it is a non-tree-based model, we should scale the features such that they are on the same scale. This is also called as Normalization of features. We try to bring the scale of all the features to [0, 1]. In this case, we can use Min Max Scaler or Standard Scaler to normalize the data.
- Min Max Scaler: The features are transformed to [0, 1].X = (X - X.min()) / (X.max() - X.min())
- Standard Scaler: The features are standardized by removing the mean and scaled to unit variance.X = (X - X.mean()) / X.std()
- Remove outliers: Outliers can distract and decrease the performance of the model. We can remove the outliers by clipping the data. You can use Numpy’s clip method to remove the outliers. We can remove the values from the feature if they are less than 1 percentile or greater than 99 percentiles. We can choose minimum and maximum thresholds based on the feature distribution. Tree-based models are robust to outliers.
- Feature Transformation: You can use rank transformation, log transformation, etc based on the feature distribution. Rank transformation ranks all the data points in either ascending or descending order. Log transformation helps to transform skewed data into a normal distribution. You can use Scipy’s rankdata method and Numpy’s log1p method for the respective transformations.
Data Engineering for Categorical features:
We can divide categorical features into two types mainly nominal and ordinal. The main difference is that we can order the values in ordinal features. We cannot order or compare the values in nominal features. For example, Sex (M/F) comes under nominal feature and Education (BTech, MTech, PhD) comes under ordinal feature. We can compare the education levels and assign ranks too. Some preprocessing techniques are:
- One hot encoding:
- This is a technique that is used to convert categories into numbers. For example, if we have 5 categories in a feature, then we create 5 features/columns with each category and assign 0 or 1 in the respective column based on the category. OneHotEncoding is widely used when we apply non-tree based methods.
- If the cardinality of a feature is very high, this technique increases the dimension of the data. In those cases, it is recommended to use BinaryEncoding. This converts each category as binary bit strings (1000100100). For example, we have 1000 categories. If we use OneHotEncoding, it creates 1000 features. Contrary to this, BinaryEncoding creates only 10 features(2^10 = 1024). It represents the same 1000 categories in 10 features.
- Label encoding: This technique maps feature to unique integers. Tree-based methods can handle categorical features. So, it is sufficient to use Label Encoding and specify that the features are categorical while initializing the model. It doesn’t increase the dimension of the high cardinality feature too.
Data Engineering for Time-related features:
In addition to numerical and categorical features, date-time features are also generated in some use cases. We can extract a lot of information from date-time features and make use of it. This helps us to capture the seasonality, time,.etc. Some preprocessing techniques are:
- From date-time feature, we can extract the day number in a week, month, season, a year from date feature and extract seconds, minutes and the hour from time feature. Some of them are categorical and some are numerical features.
- If we have a date-time feature, we can create a new feature “days since” using an anchor date or “days remaining”. It completely depends on the context.
- If we have two date-time features, we can create a feature which is the difference between them. The difference can be the number of days, number of months,.etc.
Data generation techniques:
In addition to the above techniques, I will provide some examples on how can we generate new features.
- We have 2 numerical features “Number of wheels” and “Number of spare wheels”. In this case, you can create a new feature “Total number of wheels” by adding them.
- There are 2 colours “black” and “blue”. We have the information of 100 people whether they like the colour “black” and “blue” independently. In this case, we can create a new boolean feature “Do they like both colours?” by comparing the features.
- Most of the techniques that were covered in “time-related features” involves creating new features.
The generation of new features requires domain knowledge and a thorough understanding of the data.
I hope you got some idea on Feature Engineering techniques and how to handle data. If you have any queries, comment in the comments section below. I would be more than happy to answer your queries.
Thank you for reading my blog and supporting me. Stay tuned for my next article. If you want to receive email updates, don’t forget to subscribe to my blog. Keep learning and sharing!!
Follow me here:
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
GitHub: https://github.com/Abhishekmamidi123
LinkedIn: https://www.linkedin.com/in/abhishekmamidi/
Kaggle: https://www.kaggle.com/abhishekmamidi
If you are looking for any specific blog, please do comment in the comment section below.
To my mind, the most important aspects are in the first paragraph. If you know '' Feature Scaling '', you will not face any difficulties in understanding the rest of the information.
ReplyDeleteThe development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. machine learning projects for final year In case you will succeed, you have to begin building machine learning projects in the near future.
ReplyDeleteProjects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.
Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.
The Nodejs Training Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training
Master Data Science with R & Python. Also, Learn Machine Learning & Big data. Further More Details Here-9310332343 or Visit Website-http://www.pythontrainingdelhi.com/
ReplyDeleteNice! Thanks for posting...
ReplyDeleteThe article is so interesting and I really liked it and I would also like to share it with my friends...
ReplyDeleteIt was good experience to read about dangerous punctuation. Informative for everyone looking on the subject.
ReplyDeletedata scientist training in hyderabad
The information in the post you posted here is useful because it contains some of the best information available. Thanks for sharing it. Keep up the good work Electrical engineering assignments help experts.
ReplyDeleteYou are providing good knowledge. It is really helpful and factual information for us and everyone to increase knowledge. about consulting engineers .Continue sharing your data. Thank you.
ReplyDelete