Background on Dimension Reduction
When we reduce the dimensionality of a dataset, we lose some percentage (usually 1%-15% depending on the number of components or features we keep) of the variability in the original data. Though it offers the following advantages.
- It prevents overfitting. Overfitting is a phenomenon in which the model learns too well from the training dataset and fails to generalize well for unseen real-world data.
- A lower number of dimensions in data means less training time and fewer computational resources and increases the overall performance of machine learning algorithms
- Dimensionality reduction is extremely useful for data visualization. Data in 2 or 3 dimensions is easier to visualize.
- Dimensionality reduction removes noise in the data
As mentioned above, Dimension Reduction methods can be classified into two categories.
1. Feature Selection
a. Variance Seeking
The variance method of dimension reduction is a technique that aims to reduce the number of dimensions in a dataset by selecting a subset of the most important features that capture the most variance in the data. The goal is to reduce the dimensionality of the data while retaining as much information as possible.
There are a number of ways to select the most important features using the variance method. One common approach is to calculate the variance of each feature and select the features with the highest variance. Another approach is to use a feature selection algorithm, such as mutual information or the ANOVA F-test, to identify the most important features.
The variance method of dimension reduction is often used in combination with other dimension reduction techniques, such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA), to further reduce the dimensionality of the data.
b. Backward Elimination
This method eliminates (removes) features from a dataset through a recursive feature elimination (RFE) process. The algorithm first attempts to train the model on the initial set of features in the dataset and calculates the performance of the model (usually, the accuracy score for a classification model and RMSE for a regression model). Then, the algorithm drops one feature (variable) at a time, trains the model on the remaining features, and calculates the performance scores. The algorithm repeats eliminating features until it detects a small (or no) change in the performance score of the model and stops there!
c. Forward Selection
This method can be considered as the opposite process of backward elimination. Instead of eliminating features recursively, the algorithm attempts to train the model on a single feature in the dataset and calculates the performance of the model (usually, accuracy score for a classification model and RMSE for a regression model). Then, the algorithm adds (selects) one feature (variable) at a time, trains the model on those features, and calculates the performance scores. The algorithm repeats adding features until it detects a small (or no) change in the performance score of the model and stops there!
d. Important Features from Decision Trees of Random Forest
Random forest is a tree-based model which is widely used for regression and classification tasks on non-linear data. It can also be used for feature selection with its built-in feature_importances_ attribute which calculates feature importance scores for each feature based on the 'gini' criterion (a measure of the quality of a split of internal nodes) while training the model.
2. Feature Extraction
Linear Algorithms
a. Principal component analysis (PCA): This method projects the data onto a lower-dimensional space by identifying the directions of maximum variance in the data.b. Linear discriminant analysis (LDA): This method projects the data onto a lower-dimensional space while maximizing the separation between different classes in the data.c. Singular value decomposition (SVD): This method decomposes the data matrix into the product of three matrices, which can be used to identify the principal components of the data.
d. Independent component analysis (ICA): This method seeks to identify independent latent factors that explain the variance in the data.
Non-Linear Algorithms
e. Autoencoders: These are neural network architectures that are trained to reconstruct the input data from a lower-dimensional representation, effectively learning a compressed representation of the data.f. Kernel PCA: This method extends PCA to non-linear data by using a kernel function to map the data into a higher-dimensional space before performing PCA.g. t-distributed stochastic neighbor embedding (t-SNE): This method projects the data onto a lower-dimensional space while preserving the local structure of the data.
These are just a few of the many methods available for dimension reduction in machine learning. The appropriate method to use will depend on the specific characteristics of the data and the goals of the analysis.
Why PCA?
With everything said and done,
the simple choice of the people aiming for Feature Extraction is Principal
Component Analysis (PCA). Why? Because it is simple and runs through without
much hyperparameter tuning. One reason for its popularity is that it is
relatively simple to implement and understand, as it involves finding the
eigenvectors and eigenvalues of the covariance matrix of the data. Another
reason is that it has been well-studied and has a solid theoretical foundation.
PCA is also computationally
efficient and can handle large datasets, making it suitable for use in many
practical applications. In addition, it has been shown to work well on a wide
range of data types, including continuous, categorical, and binary data.
Another reason that PCA is
popular is that it is easy to interpret the results, as the principal
components are ranked by their explained variance. This allows users to easily
see which features are most important in explaining the variance in the data.
Overall, the simplicity,
efficiency, and versatility of PCA make it a popular choice for dimension-reduction
tasks. However, it's important to keep in mind that other dimension-reduction
techniques may be more suitable for certain tasks, depending on the characteristics
of the data and the requirements of the application. For example, Linear
Discriminant Analysis (LDA) is a very solid technique that performs better than
PCA in many cases. Both PCA and LDA reduce the number of dimensions in a
dataset while retaining as much of the information as possible, though, unlike
PCA, the main goal of LDA is to maximize the separation between classes in the
data while minimizing the variance within each class. The extra input parameter
of the target variable adds weight to the LDA which shows in its improved
performance, particularly in the classification-based datasets.
Below are some areas to consider when
choosing dimension-reduction techniques.
1. Ensemble dimension reduction: Using feature extraction on top of feature selection, which could further increase the performance of machine learning algorithms.
2. The type of data: Different dimension reduction techniques are better suited to different types of data. For example, Principal Component Analysis (PCA) is a good choice for continuous data, while Linear Discriminant Analysis (LDA) is better suited for categorical data.
3. The goal of the analysis: Different dimension reduction techniques have different goals. Some techniques, such as PCA, aim to maximize the variance in the data, while others, such as LDA, aim to maximize the separation between different classes. It's important to choose a technique that aligns with the goals of the analysis.
4. The number of dimensions: Some dimension reduction techniques are better suited to high-dimensional data, while others are more effective for low-dimensional data. For example, t-SNE is a good choice for visualizing high-dimensional data, while PCA is more suitable for reducing the dimensionality of large datasets.
5. The complexity of the data: Some dimension reduction techniques, such as PCA, are relatively simple to implement and understand, while others, such as Independent Component Analysis (ICA), are more complex and may require more expertise to use effectively. It's important to choose a technique that is appropriate for the level of complexity of the data.
Overall, it's important to carefully consider the characteristics of the data and the goals of the analysis when choosing a dimension reduction technique. It may be necessary to try multiple techniques and compare the results to determine the most suitable technique for the task at hand.