Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [94]

By Root 700 0
function of another random variable X (called a predictor variable). Given n samples or data points of the form (x1, y1), (x2, y2), … , (xn, yn), where xi ∈X and yi ∈ Y, linear regression can be expressed as

where α and β are regression coefficients. With the assumption that the variance of Y is a constant, these coefficients can be solved by the method of least squares, which minimizes the error between the actual data points and the estimated line. The residual sum of squares is often called the sum of squares of the errors about the regression line and it is denoted by SSE (sum of squares error):

where yi is the real output value given in the data set, and yi’ is a response value obtained from the model. Differentiating SSE with respect to α and β, we have

Setting the partial derivatives equal to 0 (minimization of the total error) and rearranging the terms, we obtain the equations

which may be solved simultaneously to yield the computing formulas for α and β. Using standard relations for the mean values, regression coefficients for this simple case of optimization are

where meanx and meany are the mean values for random variables X and Y given in a training data set. It is important to remember that our values of α and β, based on a given data set, are only estimates of the true parameters for the entire population. The equation y = α + βx may be used to predict the mean response y0 for the given input x0, which is not necessarily from the initial set of samples.

For example, if the sample data set is given in the form of a table (Table 5.2), and we are analyzing the linear regression between two variables (predictor variable A and response variable B), then the linear regression can be expressed as

where α and β coefficients can be calculated based on previous formulas (using meanA = 5.4, and meanB = 6), and they have the values

TABLE 5.2. A Database for the Application of Regression Methods

A B

1 3

8 9

11 11

4 5

3 2

The optimal regression line is

The initial data set and the regression line are graphically represented in Figure 5.4 as a set of points and a corresponding line.

Figure 5.4. Linear regression for the data set given in Table 5.2.

Multiple regression is an extension of linear regression, and involves more than one predictor variable. The response variable Y is modeled as a linear function of several predictor variables. For example, if the predictor attributes are X1, X2, and X3, then the multiple linear regression is expressed as

where α, β1, β2, and β3 are coefficients that are found by using the method of least squares. For a linear regression model with more than two input variables, it is useful to analyze the process of determining β parameters through a matrix calculation:

where β = {β0, β1, … , βn}, β0 = α, and X and Y are input and output matrices for a given training data set. The residual sum of the squares of errors SSE will also have the matrix representation

and after optimization

the final β vector satisfies the matrix equation

where β is the vector of estimated coefficients in a linear regression. Matrices X and Y have the same dimensions as the training data set. Therefore, an optimal solution for β vector is relatively easy to find in problems with several hundreds of training samples. For real-world data-mining problems, the number of samples may increase to several millions. In these situations, because of the extreme dimensions of matrices and the exponentially increased complexity of the algorithm, it is necessary to find modifications and/or approximations in the algorithm, or to use totally different regression methods.

There is a large class of regression problems, initially nonlinear, that can be converted into the form of the general linear model. For example, a polynomial relationship such as

can be converted to the linear form by setting new variables X4 = X1 · X3 and X5 = X2 · X3. Also, polynomial regression can be modeled by adding polynomial terms to the basic linear model. For example, a cubic polynomial curve has

Return Main Page Previous Page Next Page

®Online Book Reader