Data Mining - Mehmed Kantardzic [141]
8.5 REVIEW QUESTIONS AND PROBLEMS
1. Explain the basic idea of ensemble learning, and discuss why the ensemble mechanism is able to improve prediction accuracy of a model.
2. Designing an ensemble model, there are factors that directly affect the accuracy of the ensemble. Explain those factors and approaches to each of them.
3. Bagging and boosting are very famous ensemble approaches. Both of them generate a single predictive model from each different training set. Discuss the differences between bagging and boosting, and explain the advantages and disadvantages of each of them.
4. Propose the efficient boosting approach for a large data set.
5. In the bagging methodology, a subset is formed by samples that are randomly selected with replacement from training samples. On average, a subset contains approximately what percentage of training samples?
6. In Figure 8.7, draw a picture of the next distribution D4.
Figure 8.7. AdaBoost iterations.
7. In equation (2) of the AdaBoost algorithm (Fig. 8.8), replaces the term of . Explain how and why this change influences the AdaBoost algorithm.
Figure 8.8. AdaBoost algorithm.
Figure 8.9. Top competitors in 2007/2008 for Netflix award.
8. Consider the following data set, where there are 10 samples with one dimension and two classes:
Training samples:
(a) Determine ALL the best one-level binary decision trees.
(e.g., IF f1 ≤ 0.35, THEN Class is 1, and IF f1 > 0.35, THEN Class is −1. The accuracy of that tree is 80%)
(b) We have the following five training data sets randomly selected from the above training samples. Apply the bagging procedure using those training data sets.
(i) Construct the best one-level binary decision tree from each training data set.
(ii) Predict the training samples using each constructed one-level binary decision tree.
(iii) Combine outputs predicted by each decision tree using voting method
(iv) What is the accuracy rate provided by bagging?
Training Data Set 1:
Training Data Set 2:
Training Data Set 3:
Training Data Set 4:
Training Data Set 5:
(c) Applying the AdaBoost algorithm (Fig. 8.8) to the above training samples, we generate the following initial one-level binary decision tree from those samples:
To generate the next decision tree, what is the probability (D2 in Fig. 8.8) that each sample is selected to the training data set? (αt is defined as an accuracy rate of the initial decision tree on training samples.)
9. For classifying a new sample into four classes: C1, C2, C3, and C4, we have an ensemble that consists of three different classifiers: Classifier 1, Classifiers 2, and Classifier 3. Each of them has 0.9, 0.6, and 0.6 accuracy rate on training samples, respectively. When the new sample, X, is given, the outputs of the three classifiers are as follows:
Each number in the above table describes the probability that a classifier predicts the class of a new sample as a corresponding class. For example, the probability that Classifier 1 predicts the class of X as C1 is 0.9.
When the ensemble combines predictions of each of them, as a combination method:
(a) If the simple sum is used, which class is X classified as and why?
(b) If the weight sum is used, which class is X classified as and why?
(c) If the rank-level fusion is used, which class is X classified as and why?
10. Suppose you have a drug discovery data set, which has 1950 samples and 100,000 features. You must classify chemical compounds represented by structural molecular features as active or inactive using ensemble learning. In order to generate diverse and independent classifiers for an ensemble, which ensemble methodology would you choose? Explain the reason for selecting that methodology.
11. Which of the following is a fundamental difference between bagging and boosting?
(a) Bagging is used for supervised learning. Boosting is used with unsupervised clustering.