Data Mining - Mehmed Kantardzic [49]
1. The eigenvalues of S n×n are λ1, λ2, … , λn, where
2. The eigenvectors e1, e2, … , en correspond to eigenvalues λ1, λ2, … , λn, and they are called the principal axes.
Principal axes are new, transformed axes of an n-dimensional space, where the new variables are uncorrelated, and variance for the ith component is equal to the ith eigenvalue. Because λi’s are sorted, most of the information about the data set is concentrated in a few first-principal components. The fundamental question is how many of the principal components are needed to get a good representation of the data? In other words, what is the effective dimensionality of the data set? The easiest way to answer the question is to analyze the proportion of variance. Dividing the sum of the first m eigenvalues by the sum of all the variances (all eigenvalues), we will get the measure for the quality of representation based on the first m principal components. The result is expressed as a percentage, and if, for example, the projection accounts for over 90% of the total variance, it is considered to be good. More formally, we can express this ratio in the following way. The criterion for feature selection is based on the ratio of the sum of the m largest eigenvalues of S to the trace of S. That is a fraction of the variance retained in the m-dimensional space. If the eigenvalues are labeled so that λ1 ≥ λ2 ≥ … ≥ λn, then the ratio can be written as
When the ratio R is sufficiently large (greater than the threshold value), all analyses of the subset of m features represent a good initial estimate of the n-dimensionality space. This method is computationally inexpensive, but it requires characterizing data with the covariance matrix S.
We will use one example from the literature to show the advantages of PCA. The initial data set is the well-known set of Iris data, available on the Internet for data-mining experimentation. This data set has four features, so every sample is a four-dimensional vector. The correlation matrix, calculated from the Iris data after normalization of all values, is given in Table 3.3.
TABLE 3.3. The Correlation Matrix for Iris Data
Based on the correlation matrix, it is a straightforward calculation of eigenvalues (in practice, usually, one of the standard statistical packages is used), and these final results for the Iris data are given in Table 3.4.
TABLE 3.4. The Eigenvalues for Iris Data
Feature Eigenvalue
Feature 1 2.91082
Feature 2 0.92122
Feature 3 0.14735
Feature 4 0.02061
By setting a threshold value for R* = 0.95, we choose the first two features as the subset of features for further data-mining analysis, because
For the Iris data, the first two principal components should be adequate description of the characteristics of the data set. The third and fourth components have small eigenvalues and therefore, they contain very little variation; their influence on the information content of the data set is thus minimal. Additional analysis shows that based on the reduced set of features in the Iris data, the model has the same quality using different data-mining techniques (sometimes the results were even better than with the original features).
The interpretation of the principal components can be difficult at times. Although they are uncorrelated features constructed as linear combinations of the original features, and they have some desirable properties, they do not necessarily correspond to meaningful physical quantities. In some cases, such loss of interpretability is not satisfactory to the domain scientists, and they prefer others, usually feature-selection techniques.
3.6 VALUE REDUCTION
A reduction in the number of discrete values for a given feature is based on the second set of techniques in the data-reduction phase; these are the feature-discretization techniques. The task of feature-discretization techniques is to discretize the values of continuous