Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [24]

By Root 712 0

curse-of-dimensionality concepts for data-mining tasks. They generated randomly 500 points for different n-dimensional spaces. The number of dimensions was between 2 and 50. Then, they measured in each space all distances between any pair of points and calculated the parameter P:

where n is the number of dimensions, and MAX-DIST and MIN-DIST are maximum and minimum distances in the given space, respectively. The results are presented in Figure 2.4. What is interesting from the graph is that as the number of dimensions increases, the parameter Pn approaches the value of 0. That means maximum and minimum distances are becoming very close in these spaces; in other words, there are no differences in distances between any two points in these large-dimensional spaces. It is an experimental confirmation that traditional definitions of density and distance between points, which are critical for many data-mining tasks, change their meaning. When dimensionality of a data set increases, data become increasingly sparse, with mostly outliers in the space that they occupy. Therefore, we have to revisit and reevaluate traditional concepts from statistics: distance, similarity, data distribution, mean, standard deviation, and so on.

Figure 2.4. With large number of dimensions, the concept of a distance changes the meaning.

2.2 CHARACTERISTICS OF RAW DATA

All raw data sets initially prepared for data mining are often large; many are related to human beings and have the potential for being messy. A priori, one should expect to find missing values, distortions, misrecording, inadequate sampling, and so on in these initial data sets. Raw data that do not appear to show any of these problems should immediately arouse suspicion. The only real reason for the high quality of data could be that the presented data have been cleaned up and preprocessed before the analyst sees them, as in data of a correctly designed and prepared data warehouse.

Let us see what the sources and implications of messy data are. First, data may be missing for a huge variety of reasons. Sometimes there are mistakes in measurements or recordings, but in many cases, the value is unavailable. To cope with this in a data-mining process, one must be able to model with the data that are presented, even with their values missing. We will see later that some data-mining techniques are more or less sensitive to missing values. If the method is robust enough, then the missing values are not a problem. Otherwise, it is necessary to solve the problem of missing values before the application of a selected data-mining technique. The second cause of messy data is misrecording of data, and that is typical in large volumes of data. We have to have mechanisms to discover some of these “unusual” values, and in some cases, even to work with them to eliminate their influence on the final results. Further, data may not be from the population they are supposed to be from. Outliers are typical examples here, and they require careful analysis before the analyst can decide whether they should be dropped from the data-mining process as anomalous or included as unusual examples from the population under study.

It is very important to examine the data thoroughly before undertaking any further steps in formal analysis. Traditionally, data-mining analysts had to familiarize themselves with their data before beginning to model them or use them with some data-mining algorithms. However, with the large size of modern data sets, this is less feasible or even entirely impossible in many cases. Here we must rely on computer programs to check the data for us.

Distorted data, incorrect choice of steps in methodology, misapplication of data-mining tools, too idealized a model, a model that goes beyond the various sources of uncertainty and ambiguity in the data—all these represent possibilities for taking the wrong direction in a data-mining process. Therefore, data mining is not just a matter of simply applying a directory of tools to a given problem, but rather a process of critical assessments,

Online Book Reader

Data Mining - Mehmed Kantardzic [24]

®Online Book Reader