Data Mining - Mehmed Kantardzic [90]
Once we have obtained estimates for the model parameters w from some data set T, we may use the resulting model (analytically given as a function f[X*, w]) to make predictions about Y when we know the corresponding value of the vector X*. The difference between the prediction f(X*, w) and the real value Y is called the prediction error. It should preferably take values close to 0. A natural quality measure of a model f(X*, w), as a predictor of Y, is the expected mean-squared error for the entire data set T:
In statistical testing, on the other hand, one has to decide whether a hypothesis concerning the value of the population characteristic should be accepted or rejected in light of an analysis of the data set. A statistical hypothesis is an assertion or conjecture concerning one or more populations. The truth or falsity of a statistical hypothesis can never be known with absolute certainty, unless we examine the entire population. This, of course, would be impractical in most situations, sometimes even impossible. Instead, we test a hypothesis on a randomly selected data set. Evidence from the data set that is inconsistent with the stated hypothesis leads to a rejection of the hypothesis, whereas evidence supporting the hypothesis leads to its acceptance, or more precisely, it implies that the data do not contain sufficient evidence to refute it. The structure of hypothesis testing is formulated with the use of the term null hypothesis. This refers to any hypothesis that we wish to test and is denoted by H0. H0 is only rejected if the given data set, on the basis of the applied statistical tests, contains strong evidence that the hypothesis is not true. The rejection of H0 leads to the acceptance of an alternative hypothesis about the population.
In this chapter, some statistical estimation and hypothesis-testing methods are described in great detail. These methods have been selected primarily based on the applicability of the technique in a data-mining process on a large data set.
5.2 ASSESSING DIFFERENCES IN DATA SETS
For many data-mining tasks, it would be useful to learn the more general characteristics about the given data set, regarding both central tendency and data dispersion. These simple parameters of data sets are obvious descriptors for assessing differences between different data sets. Typical measures of central tendency include mean, median, and mode, while measures of data dispersion include variance and standard deviation.
The most common and effective numeric measure of the center of the data set is the mean value (also called the arithmetic mean). For the set of n numeric values x1, x2, … , xn, for the given feature X, the mean is
and it is a built-in function (like all other descriptive statistical measures) in most modern, statistical software tools. For each numeric feature in the n-dimensional set of samples, it is possible to calculate the mean value as a central tendency characteristic for this feature. Sometimes, each value xi in a set may be associated with a weight wi, which reflects the frequency of occurrence, significance, or importance attached to the value. In this case, the weighted arithmetic mean or the weighted average value is
Although the mean is the most useful quantity that we use to describe a set of data, it is not the only one. For skewed data sets, a better measure of the center of data is the median. It is the middle value of the ordered set of feature values if the set consists of an odd number of elements and it is the average of the middle two values if the number of elements in the set is even. If x1, x2, … , xn represents a data set of size n, arranged in increasing