In real-life data science and machine learning scenarios, we often deal with large-size datasets. Dealing with tremendously large datasets is challenging and at least significantly difficult to cause a bottleneck in modelling an algorithm.
When we go deeper, we find the number of features in a dataset makes data large in size. However, not always a large number of instances comes with a large number of features, but this is not the point of discussion here. It is also very often that in a high-dimensional dataset, we find many irrelevant or insignificant features because they contribute less or zero when applying data for predictive modelling. It has also been seen that they can impact modelling negatively. Here are some possible impacts these features have in efficient predictive modelling:
- Unnecessary memory and resource allocation are required for such features and make the process slow.
- Machine learning algorithm performs poorly because such features act as noise for them.
- Modelling data with high-dimensional features takes more time than data with low dimensions.
So, feature selection comes here as a saviour here, which is also an economical solution. In this article we are going to talk about the following topics:
Table of content
- What is Feature Selection?
- Feature Selection Methods
- Difference Between Filter, Wrapper and Embedded Methods for Feature Selection
- A Case Study in Python
What is Feature Selection?
Feature selection is the process of extracting or selecting a subset of features from a dataset having a large number of features. While extracting features from a dataset, we should consider their potential level before applying them for machine learning and statistical modelling.
The motive behind this procedure is to reduce the number of input features used for final modelling. At the same time selected feature should be the most important one to model. Talking about the impact, this procedure simplifies the machine learning model and improves accuracy and efficiency. Many times it also saves models from overfitting.
The point which is noticeable here is that feature selection is different from features engineering in some cases, because feature engineering refers to the process of creating new features or variables that are not explicitly present in the original dataset but may be useful in improving the performance of a model. On the other hand, feature selection is concerned with selecting the most relevant features from a given set of features.
However, there are different methods of feature selection, such as filter wrapper methods and embedded methods. Let’s take a look at the basic methods of feature selection.
Feature Selection Methods
In general feature selection method can be classified into three main methods:
Filter methods: these methods help us in selecting important features by evaluating the statistical properties of dependent and independent features, such as correlation, mutual information, or significance tests, independent of the learning algorithm. The below image explains further methods.

Some examples of this type of method are as follows
- Correlation-based Feature Selection (CFS): In this type of feature selection procedure, we consider the correlation evaluation between the dependent and independent features of data. Here we select the subsets of features based on the highest correlation with the target feature.
- Mutual Information: this method is similar to the CFS method, but it works based on the mutual information evaluation between the dependent and independent variables. Based on the mutual information evaluation, we eliminated features from data that have the lowest mutual information with the target variables.
Principal Component Analysis (PCA): Using this method, we reduce the dimension of the data and try to get a smaller set of principal components that explain most of the variance in the data.
Wrapper methods: In this method, we evaluate the performance of the model with different subsets of features. Here we use a specific algorithm to select the best subset of features. This type of method assesses the performance of a predictive model using a particular subset of features and iteratively searches for the best subset of features that results in the highest performance. The below picture gives us a glimpse of wrapper methods of feature selection:

Some examples of wrapper methods for feature selection are as follows:
- Forward Selection: in this method, any selected algorithm starts modelling data with an empty set of features and iteratively adds one feature at a time, evaluating the performance of the predictive model at each step. This process continues until the algorithm uses a desired number of features or until it not gains an optimal performance.
- Backward Elimination: We can think of this method as the opposite of the forward selection method, where it starts with a whole set of features and removes one feature in every iteration. This process continues until the algorithm uses a desired number of features or until it not gains an optimal performance.
- Recursive Feature Elimination (RFE): With this method, we recursively remove the features from the model based on their importance in the modelling procedure, and it ends where we get optimal results from the model or optimal subset of features.
Embedded Methods: As the name suggests, this type of feature selection method perform feature selection and model training simultaneously. In embedded methods, feature selection is performed during model training, with the aim of selecting the most relevant features for the specific model being used. There are a variety of algorithms such as decision trees, support vector machines, and linear regression, that can work with embedded feature selection methods.

Some examples of embedded methods for feature selection include LASSO (Least Absolute Shrinkage and Selection Operator) and Ridge Regression, where these methods perform regularisation by shrinking the coefficients of the less important features to zero and selecting the subset of features that have non-zero coefficients for linear regression, and Decision Trees with pruning for decision tree models.
Difference Between Filter, Wrapper and Embedded Methods for Feature Selection
In the above, we have seen the basic classification of different feature selection methods, and in difference, we can say that these methods belong to three broad categories. Some basic differences between these methods are as follows:
- Filter methods are independent of any specific machine learning model, whereas Wrapper methods are used to improve the performance of any specific machine learning model. Embedded methods select features during the model training process.
- Filter methods rank the features based on their ability to explain the target variable, Wrapper methods evaluate the relevance of features based on their ability to improve the performance of a specific ML model, whereas Embedded methods incorporate the feature selection process into the model training process itself with the aim of selecting the most relevant features for the specific model being used.
- Filter methods may not always identify the optimal subset of features when there is insufficient data to capture the statistical correlations between the features. In contrast, Wrapper and Embedded methods can provide the best subset of features as they evaluate the performance of a model with different subsets of features in iterations or during the time of training exhaustively.
- Wrapper methods are generally more computationally expensive and time taking than filter methods, while embedded methods can be more efficient than wrapper methods.
- Using features selected by wrapper methods in the final machine learning model may increase the risk of overfitting as the model has already been trained using those features in multiple iterations. When talking about embedded methods, the risk of overfitting with embedded feature selection methods depends on the complexity of the model being trained, the quality of the selected features, and the regularisation techniques used. In contrast, filter methods typically select a subset of features based on their relevance to the target variable without directly incorporating the model performance into the selection process.
Good enough!
Now take a look at the basic implementation of feature selection.
A Case Study in Python
Here, we are going to use Pima Indians Diabetes Dataset, whose objective is to diagnostically predict whether or not a patient has diabetes based on certain diagnostic measurements included in the dataset.
Let’s start by importing some basic libraries, modules and packages that we will need on the way to feature selection.