Cases of feature selection

In this post I share a solution to one school submission on feature selection. Have a good day!

Submission

When we train our models with a determined number of features we increase the required computational resources needed to make a prediction. In some cases the features do not provide any meaningful data or the information is already given by another feature. We should discard these features whenever possible. There are multiple methods with which this can be done.

There are three categories of feature selection: 1) filter methods 2) wrapper methods 3) embedded methods. In this example I will go through some examples of each category.

Filter methods use statisical analysis to evaluate features. These methods are computationally less demanding than cross-validation based methods.

Information gain source

Decision trees use this method to find suitable features for the hypothesis space. Features are ranked by the amount of entropy loss when a feature is split two to groups according to a given label. This can be used to find correlation between input data and the labels. A split resulting in a little entropy loss in ranked low and vice versa.

Correlation Coefficient source

We can use Pearson’s correlation to determine whether features are linearly correlated with one another. Together linearly correlated features do not provide any additional information for classifying. Calculating these coefficients as a correlation matrix is useful information to discover if some of features are codependent. Selected features should still be correlated with the label classes.

Wrapper methods use a classifier and some metric to determine the best features to use. These often yield better results than filter methods but are computationally more demaning.

Leave Out One Feature (LOFO) source

LOFO tests the model accuracy loss by leaving out one feature at a time in every training iteration. The features causing the most accuracy lost are ranked as the most important features.

Forward Feature Selection source

In forward feature selection we go trough all possible features to predict a given label. The one that provides best accuracy is selected. We then continue to use combinations of the selected features together with the remaining ones ans see which improves the model accuracy best. We keep doing this until a sufficient accuray or a limit of features is reached.

Exhaustive Feature Selection source

Exhaustive feature selection is a brute-force method to select a group of features based on a scoring method for example AUC of ROC. It takes the minimum and maximum number of features as parameters and goes through all possible combinations and returns the group of features with the best score.

Embedded methods combine both filter methods and wrapper methods to find the best feature combinations while having reasonable computationla cost.

Random Forest Importance source

Random Forest Classifier chooses features based on Gini impurity. With a large number of decision trees we can examine all the trees and their nodes to find which features have ended up in the nodes near decision tree roots. The closer the features are to the root the more important they are.

Lasso Regression source

Lasso uses L1 regularization to determine which features to use for prediction and which to discard.

“L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient. This regularization type can result in sparse models with few coefficients. Some coefficients might become zero and get eliminated from the model. Larger penalties result in coefficient values that are closer to zero (ideal for producing simpler models).”

Conclusion

There are a few of possible feature selection methods to use. The coice of method depends on the amount of data dimensions and computational recources.

e-tinkerer