Data Mining_ Concepts and Techniques - Jiawei Han [230]
(8.31)
where
(8.32)
To determine whether M1 and M2 are significantly different, we compute t and select a significance level, sig. In practice, a significance level of 5% or 1% is typically used. We then consult a table for the t-distribution, available in standard textbooks on statistics. This table is usually shown arranged by degrees of freedom as rows and significance levels as columns. Suppose we want to ascertain whether the difference between M1 and M2 is significantly different for 95% of the population, that is, or 0.05. We need to find the t-distribution value corresponding to degrees of freedom (or 9 degrees of freedom for our example) from the table. However, because the t-distribution is symmetric, typically only the upper percentage points of the distribution are shown. Therefore, we look up the table value for , which in this case is 0.025, where z is also referred to as a confidence limit. If or , then our value of t lies in the rejection region, within the distribution's tails. This means that we can reject the null hypothesis that the means of M1 and M2 are the same and conclude that there is a statistically significant difference between the two models. Otherwise, if we cannot reject the null hypothesis, we conclude that any difference between M1 and M2 can be attributed to chance.
If two test sets are available instead of a single test set, then a nonpaired version of the t-test is used, where the variance between the means of the two models is estimated as
(8.33)
and k1 and k2 are the number of cross-validation samples (in our case, 10-fold cross-validation rounds) used for M1 and M2, respectively. This is also known as the two sample t-test. 9 When consulting the table of t-distribution, the number of degrees of freedom used is taken as the minimum number of degrees of the two models.
9This test was used in sampling cubes for OLAP-based mining in Chapter 5.
8.5.6. Comparing Classifiers Based on Cost–Benefit and ROC Curves
The true positives, true negatives, false positives, and false negatives are also useful in assessing the costs and benefits (or risks and gains) associated with a classification model. The cost associated with a false negative (such as incorrectly predicting that a cancerous patient is not cancerous) is far greater than those of a false positive (incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such cases, we can outweigh one type of error over another by assigning a different cost to each. These costs may consider the danger to the patient, financial costs of resulting therapies, and other hospital costs. Similarly, the benefits associated with a true positive decision may be different than those of a true negative. Up to now, to compute classifier accuracy, we have assumed equal costs and essentially divided the sum of true positives and true negatives by the total number of test tuples.
Alternatively, we can incorporate costs and benefits by instead computing the average cost (or benefit) per decision. Other applications involving cost–benefit analysis include loan application decisions and target marketing mailouts. For example, the cost of loaning to a defaulter greatly exceeds that of the lost business incurred by denying a loan to a nondefaulter. Similarly, in an application that tries to identify households that are likely to respond to mailouts of certain promotional material, the cost of mailouts to numerous households that do not respond may outweigh the cost of lost business from not mailing to households that would have responded. Other costs to consider in the overall analysis include the costs to collect the data and to develop the classification tool.
Receiver operating characteristic curves are a useful visual tool for comparing two classification models. ROC curves come from signal detection theory that was developed during World War II for the analysis of radar images. An ROC curve for a given model shows the trade-off between the true