Data Mining_ Concepts and Techniques - Jiawei Han [229]
7e is the base of natural logarithms, that is, .
We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model, M, is then estimated as
(8.30)
where is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i. is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples. Bootstrapping tends to be overly optimistic. It works best with small data sets.
8.5.5. Model Selection Using Statistical Tests of Significance
Suppose that we have generated two classification models, M1 and M2, from our data. We have performed 10-fold cross-validation to obtain a mean error rate8 for each. How can we determine which model is best? It may seem intuitive to select the model with the lowest error rate; however, the mean error rates are just estimates of error on the true population of future data cases. There can be considerable variance between error rates within any given 10-fold cross-validation experiment. Although the mean error rates obtained for M1 and M2 may appear different, that difference may not be statistically significant. What if any difference between the two may just be attributed to chance? This section addresses these questions.
8Recall that the error rate of a model, M, is .
To determine if there is any “real” difference in the mean error rates of two models, we need to employ a test of statistical significance. In addition, we want to obtain some confidence limits for our mean error rates so that we can make statements like, “Any observed mean will not vary by two standard errors 95% of the time for future samples” or “One model is better than the other by a margin of error of 4%.”
What do we need to perform the statistical test? Suppose that for each model, we did 10-fold cross-validation, say, 10 times, each time using a different 10-fold data partitioning. Each partitioning is independently drawn. We can average the 10 error rates obtained each for M1 and M2, respectively, to obtain the mean error rate for each model. For a given model, the individual error rates calculated in the cross-validations may be considered as different, independent samples from a probability distribution. In general, they follow a t-distribution with degrees of freedom where, here, . (This distribution looks very similar to a normal, or Gaussian, distribution even though the functions defining the two are quite different. Both are unimodal, symmetric, and bell-shaped.) This allows us to do hypothesis testing where the significance test used is the t-test, or Student's t-test. Our hypothesis is that the two models are the same, or in other words, that the difference in mean error rate between the two is zero. If we can reject this hypothesis (referred to as the null hypothesis), then we can conclude that the difference between the two models is statistically significant, in which case we can select the model with the lower error rate.
In data mining practice, we may often employ a single test set, that is, the same test set can be used for both M1 and M2. In such cases, we do a pairwise comparison of the two models for each 10-fold cross-validation round. That is, for the i th round of 10-fold cross-validation, the same cross-validation partitioning is used to obtain an error rate for M1 and for M2. Let (or ) be the error rate of model M1(or M2) on round i. The error rates for M1 are averaged to obtain a mean error rate for M1, denoted . Similarly, we can obtain . The variance of the difference between the two models is denoted . The t-test computes the t-statistic with degrees of freedom for k samples. In our example we have since, here, the k samples are our error rates obtained from ten 10-fold cross-validations