Data Mining - Mehmed Kantardzic [44]
In the second part, features are evaluated one by one starting from the end of the S2 ranked feature list. The monotonic property suggests that the backward elimination search strategy fits best in feature selection. That is, one can start with the full feature set and successively eliminate features one at a time from the bottom of the ranked list if their interaction does not contribute to better correlation with output. The criterion could be, for example, based on a covariance matrix. If a feature, together with previously selected features, shows influence on the output with less than threshold value (it could be expressed through some covariance matrix factor), the feature is removed; otherwise it is selected. Backward elimination allows every feature to be evaluated with the features it may interact with. The algorithm repeats until all features in the S2 list are checked.
3.2.2 Feature Extraction
The art of data mining starts with the design of appropriate data representations. Better performance is often achieved using features derived from the original input. Building a feature representation is an opportunity to incorporate domain knowledge into the data and can be very application-specific. Transforming the input set into a new, reduced set of features is called feature extraction. If the features extracted are carefully chosen, it is expected that the new feature set will extract the relevant information from the input data in order to perform the desired data-mining task using this reduced representation.
Feature-transformation techniques aim to reduce the dimensionality of data to a small number of dimensions that are linear or nonlinear combinations of the original dimensions. Therefore, we distinguish two major types of dimension-reduction methods: linear and nonlinear. Linear techniques result in k new derived features instead of initial p (k p). Components of the new feature are a linear combination of the original features:
or in a matrix form
where Wk×p is the linear transformation weight matrix. Such linear techniques are simpler and easier to implement than more recent methods considering nonlinear transforms.
In general, the process reduces feature dimensions by combining features instead of by deleting them. It results in a new set of fewer features with totally new values. One well-known approach is merging features by principal components. The features are examined collectively, merged, and transformed into a new set of features that, it is hoped, will retain the original information content in a reduced form. The basic transformation is linear. Given p features, they can be transformed into a single new feature F′ by the simple application of weights:
Most likely, a single set of weights, w(j), will not be an adequate transformation for a complex multidimensional data set, and up to p transformations are generated. Each vector of p features combined by weights w is called a principal component, and it defines a new transformed feature. The first vector of m weights is expected to be the strongest, and the remaining vectors are ranked according to their expected usefulness in reconstructing the original data. Eliminating the bottom-ranked transformation will reduce dimensions. The complexity of computation increases significantly with the number of features. The main weakness of the method is that it makes an advance assumption to a linear model that maximizes the variance of features. Formalization of PCA and the basic steps of the corresponding algorithm for selecting features are given in Section 3.4.
Examples of additional methodologies in feature extraction include factor analysis (FA), independent component analysis (ICA), and multidimensional scaling (MDS). Probably the last one is the most popular and it represents the basis for some new, recently developed techniques. Given n samples in a p-dimensional space and an n × n matrix of distances measures among the samples, MDS produces a k-dimensional (k p) representation of the