Data Mining - Mehmed Kantardzic [55]
Two additional techniques, stratified sampling and inverse sampling, may be convenient for some data-mining applications. Stratified sampling is a technique in which the entire data set is split into nonoverlapping subsets or strata, and sampling is performed for each different stratum independently of another. The combination of all the small subsets from different strata forms the final, total subset of data samples for analysis. This technique is used when the strata are relatively homogeneous and the variance of the overall estimate is smaller than that arising from a simple random sample. Inverse sampling is used when a feature in a data set occurs with rare frequency, and even a large subset of samples may not give enough information to estimate a feature value. In that case, sampling is dynamic; it starts with the small subset and it continues until some conditions about the required number of feature values are satisfied.
For some specialized types of problems, alternative techniques can be helpful in reducing the number of cases. For example, for time-dependent data the number of samples is determined by the frequency of sampling. The sampling period is specified based on knowledge of the application. If the sampling period is too short, most samples are repetitive and few changes occur from case to case. For some applications, increasing the sampling period causes no harm and can even be beneficial in obtaining a good data-mining solution. Therefore, for time-series data the windows for sampling and measuring features should be optimized, and that requires additional preparation and experimentation with available data.
3.9 REVIEW QUESTIONS AND PROBLEMS
1. Explain what we gain and what we lose with dimensionality reduction in large data sets in the preprocessing phase of data mining.
2. Use one typical application of data mining in a retail industry to explain monotonicity and interruptability of data-reduction algorithms.
3. Given the data set X with three input features and one output feature representing the classification of samples, X:
(a) Rank the features using a comparison of means and variances.
(b) Rank the features using Relief algorithm. Use all samples for the algorithm (m = 7).
4. Given four-dimensional samples where the first two dimensions are numeric and the last two are categorical
(a) Apply a method for unsupervised feature selection based on entropy measure to reduce one dimension from the given data set.
(b) Apply Relief algorithm under the assumption that X4 is output (classification) feature.
5.
(a) Perform bin-based value reduction with the best cutoffs for the following:
(i) the feature I3 in problem 3 using mean values as representatives for two bins.
(ii) the feature X2 in problem 4 using the closest boundaries for two bin representatives.
(b) Discuss the possibility of applying approximation by rounding to reduce the values of numeric attributes in problems 3 and 4.
6. Apply the ChiMerge technique to reduce the number of values for numeric attributes in problem 3.
Reduce the number of numeric values for feature I1 and find the final, reduced number of intervals.
Reduce the number of numeric values for feature I2 and find the final, reduced number of intervals.
Reduce the number of numeric values for feature I3 and find the final, reduced number of intervals.
Discuss the results and benefits of dimensionality reduction obtained in (a), (b), and (c).
7. Explain the differences between averaged and voted combined solutions when random samples are used to reduce dimensionality of a large data set.
8. How can the incremental-sample approach and the average-sample approach be combined to reduce cases in large data sets.
9. Develop a software