Saturday, December 31, 2022

Dimension Reduction in Machine Learning. Why PCA?

Background on Dimension Reduction

Dimensionality Reduction in Machine Learning is the process of reducing the number of dimensions in the data by excluding less useful features (Feature Selection) or transforming the data into lower dimensions (Feature Extraction). Putting it simply, Dimension Reduction refers to the process of reducing the number of attributes in a dataset while keeping as much of the variation in the original dataset as possible.

When we reduce the dimensionality of a dataset, we lose some percentage (usually 1%-15% depending on the number of components or features we keep) of the variability in the original data. Though it offers the following advantages.

  • It prevents overfitting. Overfitting is a phenomenon in which the model learns too well from the training dataset and fails to generalize well for unseen real-world data.
  • A lower number of dimensions in data means less training time and fewer computational resources and increases the overall performance of machine learning algorithms
  • Dimensionality reduction is extremely useful for data visualization. Data in 2 or 3 dimensions is easier to visualize.
  • Dimensionality reduction removes noise in the data

 



As mentioned above, Dimension Reduction methods can be classified into two categories.

1.      Feature Selection

a.      Variance Seeking

The variance method of dimension reduction is a technique that aims to reduce the number of dimensions in a dataset by selecting a subset of the most important features that capture the most variance in the data. The goal is to reduce the dimensionality of the data while retaining as much information as possible.

 

There are a number of ways to select the most important features using the variance method. One common approach is to calculate the variance of each feature and select the features with the highest variance. Another approach is to use a feature selection algorithm, such as mutual information or the ANOVA F-test, to identify the most important features.

 

The variance method of dimension reduction is often used in combination with other dimension reduction techniques, such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA), to further reduce the dimensionality of the data.

b.      Backward Elimination

This method eliminates (removes) features from a dataset through a recursive feature elimination (RFE) process. The algorithm first attempts to train the model on the initial set of features in the dataset and calculates the performance of the model (usually, the accuracy score for a classification model and RMSE for a regression model). Then, the algorithm drops one feature (variable) at a time, trains the model on the remaining features, and calculates the performance scores. The algorithm repeats eliminating features until it detects a small (or no) change in the performance score of the model and stops there!

c.      Forward Selection

This method can be considered as the opposite process of backward elimination. Instead of eliminating features recursively, the algorithm attempts to train the model on a single feature in the dataset and calculates the performance of the model (usually, accuracy score for a classification model and RMSE for a regression model). Then, the algorithm adds (selects) one feature (variable) at a time, trains the model on those features, and calculates the performance scores. The algorithm repeats adding features until it detects a small (or no) change in the performance score of the model and stops there!

d.      Important Features from Decision Trees of Random Forest

Random forest is a tree-based model which is widely used for regression and classification tasks on non-linear data. It can also be used for feature selection with its built-in feature_importances_ attribute which calculates feature importance scores for each feature based on the 'gini' criterion (a measure of the quality of a split of internal nodes) while training the model.

 

2.      Feature Extraction


Feature extraction techniques aim to transform the data from a high-dimensional space into a lower-dimensional space while preserving as much information as possible. These techniques can be either linear or non-linear, and they often involve creating new features from the original features using a mathematical transformation. It is best to visualize the data before the activity to observe its shape (linear or non-linear). Though it is important to realize that we can only visualize data in 3 or 4 dimensions. For that, we can reduce the number of dimensions through variance and visualize important dimensions based on any of the above

Linear Algorithms

a.       Principal component analysis (PCA): This method projects the data onto a lower-dimensional space by identifying the directions of maximum variance in the data.
 
b.      Linear discriminant analysis (LDA): This method projects the data onto a lower-dimensional space while maximizing the separation between different classes in the data.
 
c.      Singular value decomposition (SVD): This method decomposes the data matrix into the product of three matrices, which can be used to identify the principal components of the data.

 d.      Independent component analysis (ICA): This method seeks to identify independent latent factors that explain the variance in the data.

Non-Linear Algorithms

e.      Autoencoders: These are neural network architectures that are trained to reconstruct the input data from a lower-dimensional representation, effectively learning a compressed representation of the data.
 
f.       Kernel PCA: This method extends PCA to non-linear data by using a kernel function to map the data into a higher-dimensional space before performing PCA.
 
g.      t-distributed stochastic neighbor embedding (t-SNE): This method projects the data onto a lower-dimensional space while preserving the local structure of the data.

 


These are just a few of the many methods available for dimension reduction in machine learning. The appropriate method to use will depend on the specific characteristics of the data and the goals of the analysis.


 

Why PCA?

With everything said and done, the simple choice of the people aiming for Feature Extraction is Principal Component Analysis (PCA). Why? Because it is simple and runs through without much hyperparameter tuning. One reason for its popularity is that it is relatively simple to implement and understand, as it involves finding the eigenvectors and eigenvalues of the covariance matrix of the data. Another reason is that it has been well-studied and has a solid theoretical foundation.

PCA is also computationally efficient and can handle large datasets, making it suitable for use in many practical applications. In addition, it has been shown to work well on a wide range of data types, including continuous, categorical, and binary data.

Another reason that PCA is popular is that it is easy to interpret the results, as the principal components are ranked by their explained variance. This allows users to easily see which features are most important in explaining the variance in the data.

Overall, the simplicity, efficiency, and versatility of PCA make it a popular choice for dimension-reduction tasks. However, it's important to keep in mind that other dimension-reduction techniques may be more suitable for certain tasks, depending on the characteristics of the data and the requirements of the application. For example, Linear Discriminant Analysis (LDA) is a very solid technique that performs better than PCA in many cases. Both PCA and LDA reduce the number of dimensions in a dataset while retaining as much of the information as possible, though, unlike PCA, the main goal of LDA is to maximize the separation between classes in the data while minimizing the variance within each class. The extra input parameter of the target variable adds weight to the LDA which shows in its improved performance, particularly in the classification-based datasets.

Below are some areas to consider when choosing dimension-reduction techniques.

 

1.    Ensemble dimension reduction: Using feature extraction on top of feature selection, which could further increase the performance of machine learning algorithms.

2.    The type of data: Different dimension reduction techniques are better suited to different types of data. For example, Principal Component Analysis (PCA) is a good choice for continuous data, while Linear Discriminant Analysis (LDA) is better suited for categorical data.

3.  The goal of the analysis: Different dimension reduction techniques have different goals. Some techniques, such as PCA, aim to maximize the variance in the data, while others, such as LDA, aim to maximize the separation between different classes. It's important to choose a technique that aligns with the goals of the analysis.

4.  The number of dimensions: Some dimension reduction techniques are better suited to high-dimensional data, while others are more effective for low-dimensional data. For example, t-SNE is a good choice for visualizing high-dimensional data, while PCA is more suitable for reducing the dimensionality of large datasets.

5.   The complexity of the data: Some dimension reduction techniques, such as PCA, are relatively simple to implement and understand, while others, such as Independent Component Analysis (ICA), are more complex and may require more expertise to use effectively. It's important to choose a technique that is appropriate for the level of complexity of the data.

 

Overall, it's important to carefully consider the characteristics of the data and the goals of the analysis when choosing a dimension reduction technique. It may be necessary to try multiple techniques and compare the results to determine the most suitable technique for the task at hand.