Data Mining - Mehmed Kantardzic [99]
Let us introduce the notation. Denote the contingency table as Xm × n. The row totals for the table are
and they are valid for every row (j = 1, … , m). Similarly, we can define the column totals as
The grand total is defined as a sum of row totals:
or as a sum of column totals:
Using these totals we can calculate the contingency table of expected values under the assumption that there is no association between the row variable and the column variable. The expected values are
and they are computed for every position in the contingency table. The final result of this first step will be a totally new table that consists only of expected values, and the two tables will have the same dimensions.
For our example in Table 5.5, all sums (columns, rows, and grand total) are already represented in the contingency table. Based on these values we can construct the contingency table of expected values. The expected value on the intersection of the first row and the first column will be
Similarly, we can compute the other expected values and the final contingency table with expected values will be as given in Table 5.6.
TABLE 5.6. A 2 × 2 Contingency Table of Expected Values for the Data Given in Table 5.5
The next step in the analysis of categorical-attributes dependency is the application of the chi-squared test of association. The initial hypothesis H0 is the assumption that the two attributes are unrelated, and it is tested by Pearson’s chi-squared formula:
The greater the value of χ2, the greater the evidence against the hypothesis H0 is. For our example, comparing Tables 5.5 and 5.6, the test gives the following result:
with the d.f. for an m × n dimensional table computed as
In general, the hypothesis H0 is rejected at the level of significance α if
where T(α) is the threshold value from the χ2 distribution table usually given in textbooks on statistics. For our example, selecting α = 0.05 we obtain the threshold
A simple comparison shows that
and therefore, we can conclude that hypothesis H0 is rejected; the attributes analyzed in the survey have a high level of dependency. In other words, the attitude about abortion shows differences between the male and the female populations.
The same procedure may be generalized and applied to contingency tables where the categorical attributes have more than two values. The next example shows how the previously explained procedure can be applied without modifications to the contingency table 3 × 3. The values given in Table 5.7a are compared with the estimated values given in Table 5.7b, and the corresponding test is calculated as χ2 = 3.229. Note that in this case parameter
TABLE 5.7. Contingency Tables for Categorical Attributes with Three Values
We have to be very careful about drawing additional conclusions and further analyzing the given data set. It is quite obvious that the sample size is not large. The number of observations in many cells of the table is small. This is a serious problem and additional statistical analysis is necessary to check if the sample is a good representation of the total population or not. We do not cover this analysis here because in most real-world data-mining problems the data set is enough large to eliminate the possibility of occurrence of these deficiencies.
That was one level of generalization for an analysis of contingency tables with categorical data. The other direction of generalization is inclusion into analysis of more than two categorical attributes. The methods for three- and high-dimensional contingency table analysis are described in many books on advanced statistics; they explain the procedure of discovered dependencies between several attributes that are analyzed simultaneously.
5.8 LDA
LDA is concerned with classification problems where the dependent variable is categorical (nominal or ordinal) and the independent variables are metric. The objective of LDA is to construct