Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [39]

By Root 931 0
constraints for the final solution.

Algorithms that perform all basic operations for data reduction are not simple, especially when they are applied to large data sets. Therefore, it is useful to enumerate the desired properties of these algorithms before giving their detailed descriptions. Recommended characteristics of data-reduction algorithms that may be guidelines for designers of these techniques are as follows:

1. Measurable Quality. The quality of approximated results using a reduced data set can be determined precisely.

2. Recognizable Quality. The quality of approximated results can be easily determined at run time of the data-reduction algorithm, before application of any data-mining procedure.

3. Monotonicity. The algorithms are usually iterative, and the quality of results is a nondecreasing function of time and input data quality.

4. Consistency. The quality of results is correlated with computation time and input data quality.

5. Diminishing Returns. The improvement in the solution is large in the early stages (iterations) of the computation, and it diminishes over time.

6. Interruptability. The algorithm can be stopped at any time and provide some answers.

7. Preemptability. The algorithm can be suspended and resumed with minimal overhead.

3.2 FEATURE REDUCTION


Most of the real-world data mining applications are characterized by high-dimensional data, where not all of the features are important. For example, high-dimensional data (i.e., data sets with hundreds or even thousands of features) can contain a lot of irrelevant, noisy information that may greatly degrade the performance of a data-mining process. Even state-of-the-art data-mining algorithms cannot overcome the presence of a large number of weakly relevant and redundant features. This is usually attributed to the “curse of dimensionality,” or to the fact that irrelevant features decrease the signal-to-noise ratio. In addition, many algorithms become computationally intractable when the dimensionality is high.

Data such as images, text, and multimedia are high-dimensional in nature, and this dimensionality of data poses a challenge to data-mining tasks. Researchers have found that reducing the dimensionality of data results in a faster computation while maintaining reasonable accuracy. In the presence of many irrelevant features, mining algorithms tend to overfit the model. Therefore, many features can be removed without performance deterioration in the mining process.

When we are talking about data quality and improved performances of reduced data sets, we can see that this issue is not only about noisy or contaminated data (problems mainly solved in the preprocessing phase), but also about irrelevant, correlated, and redundant data. Recall that data with corresponding features are not usually collected solely for data-mining purposes. Therefore, dealing with relevant features alone can be far more effective and efficient. Basically, we want to choose features that are relevant to our data-mining application in order to achieve maximum performance with minimum measurement and processing effort. A feature-reduction process should result in

1. less data so that the data-mining algorithm can learn faster;

2. higher accuracy of a data-mining process so that the model can generalize better from the data;

3. simple results of the data-mining process so that they are easier to understand and use; and

4. fewer features so that in the next round of data collection, savings can be made by removing redundant or irrelevant features.

Let us start our detailed analysis of possible column-reduction techniques, where features are eliminated from the data set based on a given criterion. To address the curse of dimensionality, dimensionality-reduction techniques are proposed as a data-preprocessing step. This process identifies a suitable low-dimensional representation of original data. Reducing the dimensionality improves the computational efficiency and accuracy of the data analysis. Also, it improves comprehensibility of a data-mining model.

Return Main Page Previous Page Next Page

®Online Book Reader