Skip to content Skip to footer

Beginners Guide to Feature Selection

In real-life data science and machine learning scenarios, we often deal with large-size datasets. Dealing with tremendously large datasets is challenging and at least significantly difficult to cause a bottleneck in modelling an algorithm.

When we go deeper, we find the number of features in a dataset makes data large in size. However, not always a large number of instances comes with a large number of features, but this is not the point of discussion here. It is also very often that in a high-dimensional dataset, we find many irrelevant or insignificant features because they contribute less or zero when applying data for predictive modelling. It has also been seen that they can impact modelling negatively. Here are some possible impacts these features have in efficient predictive modelling:

  • Unnecessary memory and resource allocation are required for such features and make the process slow.
  • Machine learning algorithm performs poorly because such features act as noise for them.
  • Modelling data with high-dimensional features takes more time than data with low dimensions.

So, feature selection comes here as a saviour here, which is also an economical solution. In this article we are going to talk about the following topics:

Table of content

  • What is Feature Selection?
  • Feature Selection Methods
  • Difference Between Filter, Wrapper and Embedded Methods for Feature Selection
  • A Case Study in Python

What is Feature Selection?

Feature selection is the process of extracting or selecting a subset of features from a dataset having a large number of features. While extracting features from a dataset, we should consider their potential level before applying them for machine learning and statistical modelling.

The motive behind this procedure is to reduce the number of input features used for final modelling. At the same time selected feature should be the most important one to model. Talking about the impact, this procedure simplifies the machine learning model and improves accuracy and efficiency. Many times it also saves models from overfitting.

The point which is noticeable here is that feature selection is different from features engineering in some cases, because feature engineering refers to the process of creating new features or variables that are not explicitly present in the original dataset but may be useful in improving the performance of a model. On the other hand, feature selection is concerned with selecting the most relevant features from a given set of features.

However, there are different methods of feature selection, such as filter wrapper methods and embedded methods. Let’s take a look at the basic methods of feature selection.

Feature Selection Methods

In general feature selection method can be classified into three main methods:

Filter methods: these methods help us in selecting important features by evaluating the statistical properties of dependent and independent features, such as correlation, mutual information, or significance tests, independent of the learning algorithm. The below image explains further methods.

Some examples of this type of method are as follows

  • Correlation-based Feature Selection (CFS): In this type of feature selection procedure, we consider the correlation evaluation between the dependent and independent features of data. Here we select the subsets of features based on the highest correlation with the target feature.
  • Mutual Information: this method is similar to the CFS method, but it works based on the mutual information evaluation between the dependent and independent variables. Based on the mutual information evaluation, we eliminated features from data that have the lowest mutual information with the target variables.

Principal Component Analysis (PCA): Using this method, we reduce the dimension of the data and try to get a smaller set of principal components that explain most of the variance in the data.

Wrapper methods: In this method, we evaluate the performance of the model with different subsets of features. Here we use a specific algorithm to select the best subset of features. This type of method assesses the performance of a predictive model using a particular subset of features and iteratively searches for the best subset of features that results in the highest performance. The below picture gives us a glimpse of wrapper methods of feature selection:

Some examples of wrapper methods for feature selection are as follows:

  • Forward Selection: in this method, any selected algorithm starts modelling data with an empty set of features and iteratively adds one feature at a time, evaluating the performance of the predictive model at each step. This process continues until the algorithm uses a desired number of features or until it not gains an optimal performance.
  • Backward Elimination: We can think of this method as the opposite of the forward selection method, where it starts with a whole set of features and removes one feature in every iteration. This process continues until the algorithm uses a desired number of features or until it not gains an optimal performance.
  • Recursive Feature Elimination (RFE): With this method, we recursively remove the features from the model based on their importance in the modelling procedure, and it ends where we get optimal results from the model or optimal subset of features.

Embedded Methods: As the name suggests, this type of feature selection method perform feature selection and model training simultaneously. In embedded methods, feature selection is performed during model training, with the aim of selecting the most relevant features for the specific model being used. There are a variety of algorithms such as decision trees, support vector machines, and linear regression, that can work with embedded feature selection methods.

Some examples of embedded methods for feature selection include LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge Regression, where these methods perform regularisation by shrinking the coefficients of the less important features to zero and selecting the subset of features that have non-zero coefficients for linear regression, and Decision Trees with pruning for decision tree models.

Difference Between Filter, Wrapper and Embedded Methods for Feature Selection

In the above, we have seen the basic classification of different feature selection methods, and in difference, we can say that these methods belong to three broad categories. Some basic differences between these methods are as follows:

  • Filter methods are independent of any specific machine learning model, whereas Wrapper methods are used to improve the performance of any specific machine learning model. Embedded methods select features during the model training process.
  • Filter methods rank the features based on their ability to explain the target variable, Wrapper methods evaluate the relevance of features based on their ability to improve the performance of a specific ML model, whereas Embedded methods incorporate the feature selection process into the model training process itself with the aim of selecting the most relevant features for the specific model being used.
  • Filter methods may not always identify the optimal subset of features when there is insufficient data to capture the statistical correlations between the features. In contrast, Wrapper and Embedded methods can provide the best subset of features as they evaluate the performance of a model with different subsets of features in iterations or during the time of training exhaustively.
  • Wrapper methods are generally more computationally expensive and time taking than filter methods, while embedded methods can be more efficient than wrapper methods.
  • Using features selected by wrapper methods in the final machine learning model may increase the risk of overfitting as the model has already been trained using those features in multiple iterations. When talking about embedded methods, the risk of overfitting with embedded feature selection methods depends on the complexity of the model being trained, the quality of the selected features, and the regularisation techniques used. In contrast, filter methods typically select a subset of features based on their relevance to the target variable without directly incorporating the model performance into the selection process.

Good enough!

Now take a look at the basic implementation of feature selection.

A Case Study in Python

Here, we are going to use Pima Indians Diabetes Dataset, whose objective is to diagnostically predict whether or not a patient has diabetes based on certain diagnostic measurements included in the dataset.

Let’s start by importing some basic libraries, modules and packages that we will need on the way to feature selection.

import pandas as pd

import numpy as np

from sklearn.feature_selection import SelectKBest, chi2, RFE

from sklearn.linear_model import LogisticRegression

Now, let’s import the dataset.

data = pd.read_csv(“/content/diabetes.csv”)

After successfully importing the data, let’s take a look at some of the rows.

data.head()

In the above, we can see that eight features in the dataset are told about the patient being diabetic in the form of 0 and 1. Talking about the missing values on the data, we can see the NAN values are replaced by 0. Anyone can deduce this by knowing the definition of the columns because it is impractical to have zero values in body mass and insulin columns.

Now we can convert these data into numpy array form to get faster computation.

array = data.values

#features

X = array[:,0:8]

#target

Y = array[:,8]

Filter Method

Here, we will perform a chi-squared statistical test for features with non-negative values and will select three features from the data. The chi-squared test belongs to the filter method of feature selection.

test = SelectKBest(score_func=chi2, k=4)

fit = test.fit(X, Y)

print(fit.scores_)

Output:

Here, we can see the Chi-square score of the features. Now we can transform important features. Let’s take a look.

features = fit.transform(X)

print(features[0:5,:])

Output:

Here are the four selected features of the dataset based on the chi-square test.

Wrapper Method

Next, we will take a look at the implementation of Recursive Feature Elimination, which belongs to the wrapper method of feature selection. In the above, we have explained how this method works.

We know that the wrapper methods are used to improve the performance of any specific machine learning model so here we will work with the logistic regression model.

model = LogisticRegression()

rfe = RFE(model, n_features_to_select=3, step=3)

fit = rfe.fit(X, Y)

Output:

Here, we have applied the RFE feature selection for the logistic regression model. Lets see the results now.

print(“Num Features: \n”, fit.n_features_)

print(“Selected Features: \n”, fit.support_)

print(“Feature Ranking: \n”, fit.ranking_)

Output:

Here we can see the ranking of the features of the dataset, also in the second output we can see which features are supporting the most. Now let’s take a look at the embedded method.

Embedded Method

Here, we will use the lasso regression for feature selection. Basically, it is a regression technique which adds a penalty term to the cost function of regression that encourages sparsity in the coefficients.

In practice, Lasso can be used as a feature selection method by fitting a Lasso regression model on a dataset and examining the resulting coefficient vector to determine which features are important. Features with non-zero coefficients are considered important, while those with zero coefficients can be discarded.

Let’s make an object of lasso regression and fit the data on it.

# Fit Lasso model

lasso = Lasso(alpha=0.1)

lasso.fit(X, Y)

Let’s check the importance of all the features

# Extract coefficients and print feature importance

coef = np.abs(lasso.coef_)

print(“Feature importance:\n”)

for i in range(len(data.columns)):

print(f”{data.columns[i]}: {coef[i]}”)

Output:

Here we can see the ranking of important features when we use the lasso regression.

Final words

Till now, we have discussed feature selection, different methods of feature selection and a basic implementation of feature selection using the Python programming language. Because of this article, we get to know that the subject feature selection is itself a big course, so in future articles, we will take a look at more details of this topic where, one by one, we will explain all the variants of the feature selection method.

About DSW

DSW, specializing in Artificial Intelligence and Data Science, provides platforms and solutions for leveraging data through AI and advanced analytics. With offices located in Mumbai, India, and Dublin, Ireland, the company serves a broad range of customers across the globe.

Our mission is to democratize AI and Data Science, empowering customers with informed decision-making. Through fostering the AI ecosystem with data-driven, open-source technology solutions, we aim to benefit businesses, customers, and stakeholders and make AI available for everyone.

Our flagship platform ‘UnifyAI’ aims to streamline the data engineering process, provide a unified pipeline, and integrate AI capabilities to support businesses in transitioning from experimentation to full-scale production, ultimately enhancing operational efficiency and driving growth.