Need: Feature engineering is one of the most important skills needed in data science and machine learning. It has a major influence on the performance of machine learning models and even the quality of insights derived during exploratory data analysis (EDA).

What is Feature Engineering?

Feature engineering is the process of using data’s domain knowledge to create features that make machine learning algorithms work (Wikipedia). It’s the act of extracting important features from raw data and transforming them into formats that are suitable for machine learning.

To perform feature engineering, a data scientist combines domain knowledge (knowledge about a specific field) with math and programming skills to transform or come up with new features that will help a machine learning model perform better.

How to handle categorical features

Machine learning models cannot work with categorical features the way they are. These features must be converted to numerical forms before they can be used. The process of converting categorical features to numerical form is called encoding.

The two most popular techniques are an Ordinal encoding and a One Hot encoding

One-Hot EncodingOne-hot encoding uses binary values to represent classes. It creates a feature per category, and can quickly become inefficient as the number of classes in the categorical feature increases.

Ordinal encoding :In ordinal encoding, each unique category value is assigned an integer value. For example, red is 1, green is 2, and blue is 3. This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used. For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

Label Encoding :If you have a large number of classes in a categorical feature, you can use label encoding. Label encoding assigns a unique label (integer number) to a specific class.

Hash Encoding: Hash encoding or feature hashing is a fast and space-efficient way of encoding features. It’s very efficient for categorical features with large numbers of classes. A hash encoder works by applying a hash function to the features. We demonstrate how to use this below.

  • First, we specify the features we want to hash encode.
  • Next, we create a hash encoder object and specify the length of the hash vector to be used.

Finally, we fit-transform the dataset

Target Encoding: In target encoding, we calculate the average of the target value by a specific category and replace that categorical feature with the result. Target encoding helps preserve useful properties of the feature and can sometimes help improve classification models—however, it can sometimes lead to severe overfitting.

Count Encoding  : Count encoding is converting each categorical value to its frequency, ie. the number of times it appears in the dataset.

How to handle numerical/continuous features

Numerical/Continuous features are the most common type of features found in datasets. They can represent values from a given range.

Log Transformation: Log transformation helps to center (or in statistical terms normally distribute) data. This strategy can help most machine learning models perform better.

Normalization of Features

Normalization helps change the values of numeric features to a common scale, without distorting differences in the range of values or losing information. Normalization is very important for distance-based models like KNNs, and it also helps speed up training in neural networks.

  1. StandardScaler: Standardize features by subtracting the mean and scaling to unit variance.
  2. RobustScaler: Scale features using statistics that are robust to outliers.
  3. MinMaxScaler: Normalize features by scaling each feature to a specified range (range depends on you!).

Quantile Transformation As we mentioned, sometimes machine learning algorithms require that the distribution of our data is uniform or normal

Github Link

https://github.com/drstatsvenu/data-engineering

Source:

https://heartbeat.comet.ml/a-practical-guide-to-feature-engineering-in-python-8326e40747c8

https://github.com/risenW/Practical_feature_engineering_guide/blob/master/Practical%20Featture%20Engineering%20Guide.ipynb

Print Friendly, PDF & Email
Feature Engineering

Venugopal Manneni


A doctor in statistics from Osmania University. I have been working in the fields of Analytics and research for the last 15 years. My expertise is to architecting the solutions for the data driven problems using statistical methods, Machine Learning and deep learning algorithms for both structured and unstructured data. In these fields I’ve also published papers. I love to play cricket and badminton.


Post navigation