Data Mining - Mehmed Kantardzic [97]
The matrix R has the residual sum of squares for each of the c dimensions stored on its leading diagonal. The off-diagonal elements are the residual sums of cross-products for pairs of dimensions. If we wish to compare two nested linear models to determine whether certain β’s are equal to 0, then we can construct an extra sum of squares matrix and apply a method similar to ANOVA—MANOVA. While we had an F-statistic in the ANOVA methodology, MANOVA is based on matrix R with four commonly used test statistics: Roy’s greatest root, the Lawley-Hotteling trace, the Pillai trace, and Wilks’ lambda. Computational details of these tests are not explained in the book, but most textbooks on statistics will explain these; also, most standard statistical packages that support MANOVA support all four statistical tests and explain which one to use under what circumstances.
Classical multivariate analysis also includes the method of principal component analysis, where the set of vector samples is transformed into a new set with a reduced number of dimensions. This method has been explained in Chapter 3 when we were talking about data reduction and data transformation as preprocessing phases for data mining.
5.6 LOGISTIC REGRESSION
Linear regression is used to model continuous-value functions. Generalized regression models represent the theoretical foundation on that the linear regression approach can be applied to model categorical response variables. A common type of a generalized linear model is logistic regression. Logistic regression models the probability of some event occurring as a linear function of a set of predictor variables.
Rather than predicting the value of the dependent variable, the logistic regression method tries to estimate the probability that the dependent variable will have a given value. For example, in place of predicting whether a customer has a good or bad credit rating, the logistic regression approach tries to estimate the probability of a good credit rating. The actual state of the dependent variable is determined by looking at the estimated probability. If the estimated probability is greater than 0.50 then the prediction is closer to YES (a good credit rating), otherwise the output is closer to NO (a bad credit rating is more probable). Therefore, in logistic regression, the probability p is called the success probability.
We use logistic regression only when the output variable of the model is defined as a categorical binary. On the other hand, there is no special reason why any of the inputs should not also be quantitative, and, therefore, logistic regression supports a more general input data set. Suppose that output Y has two possible categorical values coded as 0 and 1. Based on the available data we can compute the probabilities for both values for the given input sample: P(yj = 0) = 1 − pj and P(yj = 1) = pj. The model with which we will fit these probabilities is accommodated linear regression:
This equation is known as the linear logistic model. The function log (pj/[1−pj]) is often written as logit(p). The main reason for using the logit form of output is to prevent the predicting probabilities from becoming values out of the required range [0, 1]. Suppose that the estimated model, based on a training data set and using the linear regression procedure, is given with a linear equation
and also suppose that the new sample for classification has input values {x1, x2, x3} = {1, 0, 1}. Using the linear logistic model, it is possible to estimate the probability of the output value 1, (p[Y = 1]) for this sample. First, calculate the corresponding logit(p):
and then the probability of the output value 1 for the given inputs:
Based on the final value for probability p, we may conclude that output value Y = 1 is more probable than the other categorical value Y = 0. Even this simple example shows that logistic regression is a very simple yet powerful classification tool in data-mining