Data Mining_ Concepts and Techniques - Jiawei Han [64]
χ2 Correlation Test for Nominal Data
For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a χ2 (chi-square) test. Suppose A has c distinct values, namely a1, a2, … ac. B has r distinct values, namely b1, b2, … br. The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows. Let (Ai, Bj) denote the joint event that attribute A takes on value ai and attribute B takes on value bj, that is, where (A = ai, B = bj). Each and every possible (Ai, Bj) joint event has its own cell (or slot) in the table. The χ2 value (also known as the Pearson χ2 statistic) is computed as
(3.1)
where oij is the observed frequency (i.e., actual count) of the joint event (Ai, Bj) and eij is the expected frequency of (Ai, Bj), which can be computed as
(3.2)
where n is the number of data tuples, count(A = ai) is the number of tuples having value ai for A, and count(B = bj) is the number of tuples having value bj for B. The sum in Eq. (3.1) is computed over all of the r × c cells. Note that the cells that contribute the most to the χ2 value are those for which the actual count is very different from that expected.
The χ2 statistic tests the hypothesis that A and B are independent, that is, there is no correlation between them. The test is based on a significance level, with (r − 1) × (c − 1) degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesis can be rejected, then we say that A and B are statistically correlated.
Correlation analysis of nominal attributes using χ2
Suppose that a group of 1500 people was surveyed. The gender of each person was noted. Each person was polled as to whether his or her preferred type of reading material was fiction or nonfiction. Thus, we have two attributes, gender and preferred_reading. The observed frequency (or count) of each possible joint event is summarized in the contingency table shown in Table 3.1, where the numbers in parentheses are the expected frequencies. The expected frequencies are calculated based on the data distribution for both attributes using Eq. (3.2).
Table 3.1 Example 2.1's 2 × 2 Contingency Table Data
Note: Are gender and preferred_reading correlated?
malefemaleTotal
fiction 250 (90) 200 (360) 450
non_fiction 50 (210) 1000 (840) 1050
Total 300 1200 1500
Using Eq. (3.2), we can verify the expected frequencies for each cell. For example, the expected frequency for the cell (male, fiction) is
and so on. Notice that in any row, the sum of the expected frequencies must equal the total observed frequency for that row, and the sum of the expected frequencies in any column must also equal the total observed frequency for that column.
Using Eq. (3.1) for χ2 computation, we get
For this 2 × 2 table, the degrees of freedom are (2 − 1)(2 − 1) = 1. For 1 degree of freedom, the χ2 value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage points of the χ2 distribution, typically available from any textbook on statistics). Since our computed value is above this, we can reject the hypothesis that gender and preferred_reading are independent and conclude that the two attributes are (strongly) correlated for the given group of people.
Correlation Coefficient for Numeric Data
For numeric attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient (also known as Pearson's product moment coefficient, named after its inventer, Karl Pearson). This is
(3.3)
where n is the number of tuples, ai and bi are the respective values of A and B in tuple i, Ā and are the respective mean values of A and B, σA and σB are the respective standard deviations of A and B (as defined in Section 2.2.2), and Σ(aibi) is the sum of the AB cross-product