Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [35]

By Root 787 0
no mechanism for processing categorical data with no implicit ordering.

2.7 REVIEW QUESTIONS AND PROBLEMS

1. Generate the tree structure of data types explained in Section 2.1.

2. If one attribute in the data set is student grade with values A, B, C, D, and F, what type are these attribute values? Give a recommendation for preprocessing of the given attribute.

3. Explain why “the curse of dimensionality” principles are especially important in understanding large data sets.

4. Every attribute in a 6-D sample is described with one out of three numeric values {0, 0.5, 1}. If there exist samples for all possible combinations of attribute values, what will be the number of samples in a data set and what will be the expected distance between points in a 6-D space?

5. Derive the formula for min–max normalization of data on [−1, 1] interval.

6. Given 1-D data set X = {−5.0, 23.0, 17.6, 7.23, 1.11}, normalize the data set using

(a) decimal scaling on interval [−1, 1],

(b) min–max normalization on interval [0, 1],

(c) min–max normalization on interval [−1, 1], and

(d) standard deviation normalization.

Compare the results of previous normalizations and discuss the advantages and disadvantages of the different techniques.

7. Perform data smoothing using a simple rounding technique for a data set

and present the new data set when the rounding is performed to the precision of

(a) 0.1 and

(b) 1.

8. Given a set of 4-D samples with missing values,

if the domains for all attributes are [0, 1, 2], what will be the number of “artificial” samples if missing values are interpreted as “don’t care values” and they are replaced with all possible values for a given domain?

9. A 24-h, time-dependent data set X is collected as a training data set to predict values 3 h in advance. If the data set X is

(a) What will be a standard tabular representation of data set X if

(i) the window width is 6, and a prediction variable is based on the difference between the current value and the value after 3 h. What is the number of samples?

(ii) the window 4width is 12, and the prediction variable is based on ratio. What is the number of samples?

(b) Plot discrete X values together with computed 6- and 12-h MA.

(c) Plot time-dependent variable X and its 4-h EMA.

10. The number of children for different patients in a database is given with a vector

Find the outliers in set C using standard statistical parameters mean and variance.

If the threshold value is changed from ±3 standard deviations to ±2 standard deviations, what additional outliers are found?

11. For a given data set X of 3-D samples,

(a) find the outliers using the distance-based technique if

(i) the threshold distance is 4, and threshold fraction p for non-neighbor samples is 3, and

(ii) the threshold distance is 6, and threshold fraction p for non-neighbor samples is 2.

(b) Describe the procedure and interpret the results of outlier detection based on mean values and variances for each dimension separately.

12. Discuss the applications in which you would prefer to use EMA instead of MA.

13. If your data set contains missing values, discuss the basic analyses and corresponding decisions you will take in the preprocessing phase of the data-mining process.

14. Develop a software tool for the detection of outliers if the data for preprocessing are given in the form of a flat file with n-dimensional samples.

15. The set of seven 2-D samples is given in the following table. Check if we have outliers in the data set. Explain and discuss your answer.

Sample # X Y

1 1 3

2 7 1

3 2 4

4 6 3

5 4 2

6 2 2

7 7 2

16. Given the data set of 10 3-D samples: {(1,2,0), (3,1,4), (2,1,5), (0,1,6), (2,4,3), (4,4,2), (5,2,1), (7,7,7), (0,0,0), (3,3,3)}, is the sample S4 = (0,1,6) outlier if the threshold values for the distance d = 6, and for the number of samples in the neighborhood p > 2? (Note: Use distance-based outlier-detection technique.)

17. What is the difference between nominal and ordinal data? Give examples.

18. Using the method of distance-based outliers

Return Main Page Previous Page Next Page

®Online Book Reader