Jump to content

Dimensionality reduction: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Feature selection: disambiguate
No edit summary
Line 7: Line 7:
[[Feature selection]] approaches try to find a subset of the original variables (also called features or attributes). Two strategies are ''filter'' (e.g. [[Information gain in decision trees|information gain]]) and ''wrapper'' (e.g. search guided by the accuracy) approaches. See also [[combinatorial optimization]] problems.
[[Feature selection]] approaches try to find a subset of the original variables (also called features or attributes). Two strategies are ''filter'' (e.g. [[Information gain in decision trees|information gain]]) and ''wrapper'' (e.g. search guided by the accuracy) approaches. See also [[combinatorial optimization]] problems.


It is sometimes the case that [[data analysis]] such as [[Regression analysis|regression]] or [[Statistical classification|classification]] can be done in the reduced space more accurately than in the original space.
In some cases, [[data analysis]] such as [[Regression analysis|regression]] or [[Statistical classification|classification]] can be done in the reduced space more accurately than in the original space.


==Feature extraction==
==Feature extraction==

Revision as of 07:21, 5 May 2010

In statistics, dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.

Feature selection

Feature selection approaches try to find a subset of the original variables (also called features or attributes). Two strategies are filter (e.g. information gain) and wrapper (e.g. search guided by the accuracy) approaches. See also combinatorial optimization problems.

In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.

Feature extraction

Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal components analysis, but many non-linear techniques also exist.

The main linear technique for dimensionality reduction, principal components analysis (PCA), performs a linear mapping of the data to a lower dimensional space in such a way, that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behaviour of the system. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.

Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting techniques is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is entitled Kernel PCA. Other prominent nonlinear techniques include manifold learning techniques such as locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neigbhors.

An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling (which is identical to PCA), Isomap (which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.

A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feed-forward neural networks with a bottle-neck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of Restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation.

See also