Data Science: Techniques for Data Reduction in Data Pre-processing

YASH PATEL
5 min readOct 26, 2021

Data reduction is a process that reduced the volume of original data and represents it in a much smaller volume. Data reduction techniques ensure the integrity of data while reducing the data.

On the scikit-learn website, many feature selection techniques were provided. We will examine the performance of several feature selection techniques on the same data set.

Dataset

The dataset used for data reduction is the ‘Iris’ dataset from the sklearn.datasets module.

Importing all required library to load dataset

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, RFE, SelectFromModel, SelectKBest, f_classif, chi2, mutual_info_classif from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.datasets import load_iris

Loading and visualizing the dataset

Adding Noise and Splitting Data

The dataset contains four different attributes. We include some noisy features into the data set to check the effectiveness of various feature selection techniques or methodologies.

The dataset currently has 14 attributes. We must first split the data before applying the feature selection technique. The reason for this is that we just choose features based on information from the training set, rather than the entire data set. To evaluate the success of the feature selection and the model, we should keep a portion of the entire data set as a test set. As a result, the information from the test set is hidden while we do feature selection and train the model.

The feature selection will be determined by the X train and y train.

Variance Threshold

The Variance Threshold method is an easy baseline technique to feature selection. It eliminates any characteristics whose variance falls below a certain threshold. It eliminates all zero-variance features by default. Because our dataset lacks a zero variance feature, our data is unaffected in this case. To learn more about Variance Threshold.

Univariate Feature Selection

  • The best features are chosen using univariate statistical tests in univariate feature selection.
  • Each characteristic is compared to the target variable to check if there is a statistically significant link between them.
  • We ignore the other characteristics while analyzing the link between one feature and the target variable. That is why it is referred to as ‘univariate’.
  • Each feature has its test score.
  • Finally, all of the test scores are compared, and the attributes with the highest scores are chosen.
  • These objects accept a scoring function that provides univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):

For regression: f_regression, mutual_info_regression

For classification: chi2, f_classif, mutual_info_classif

  1. f_classif

Also known as ANOVA

2. chi2

3. mutual_info_classif

  1. Classification
  2. Regression

Recursive Feature Elimination

Recursive feature elimination (RFE) selects features by recursively examining smaller and smaller sets of features given an external estimator that gives weights to features (e.g., the coefficients of a linear model).

The estimator is first trained on the original set of features, and the significance of each feature is determined using either a coef_ attribute or a feature_importances_ attribute. The least significant features are then removed from the existing collection of features. This method is done recursively on the trimmed set until the required number of features to select is attained.

Principal Component Analysis (PCA)

PCA Because PCA produces a feature subspace that maximizes variation along the axes, it makes sense to normalize the data, especially if it was measured on multiple scales. Even though all characteristics in the Iris dataset were measured in centimeters, let us continue with the data translation into the unit scale (mean=0 and variance=1), which is required for the best performance of many machine learning algorithms.

PCA Projection to 2D

The original dataset had four columns (sepal length, sepal width, petal length, and petal width). This section’s code transforms four-dimensional data into two-dimensional data. The additional components reflect the two primary axes of variation.

finalDf is the final DataFrame before charting the data, created by concatenating DataFrame along axis = 1

PCA Projection to 3D

The original data set had four columns (sepal length, sepal width, petal length, and petal width). This section’s code transforms four-dimensional data into three-dimensional data. The new components reflect the three primary aspects of variation.

Differences Between Before and After Using Feature Selection

  1. Before using Feature Selection

2. After using Feature Selection

Precision, recall, f1-score, and accuracy are nearly the same in both outcomes. This illustrates the significance of using feature selection to increase the model’s performance.

Summary

In this blog, I analyzed and contrasted the outcomes of several feature selection techniques on the same. The model performs better when just the remaining features are utilized after feature selection when all of the features are used to train the model. After feature selection, PCA was used to visualize the data frame in 2D and 3D with reduced components.

--

--