Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [95]

By Root 789 0
a form

By applying transformation to the predictor variables (X1 = X, X2 = X2, and X3 = X3), it is possible to linearize the model and transform it into a multiple-regression problem, which can be solved by the method of least squares. It should be noted that the term linear in the general linear model applies to the dependent variable being a linear function of the unknown parameters. Thus, a general linear model might also include some higher order terms of independent variables, for example, terms such as X12, eβX, X1·X2, 1/X, or X23. The basis is, however, to select the proper transformation of input variables or their combinations. Some useful transformations for linearization of the regression model are given in Table 5.3.

TABLE 5.3. Some Useful Transformations to Linearize Regression

Function Proper Transformation Form of Simple

Linear Regression

Exponential:

Y = α e βx Y* = ln Y Regress Y* against x

Power:

Y = α xβ Y* = logY; x* = log x Regress Y* against x*

Reciprocal:

Y = α + β(1/x) x* = 1/x Regress Y against x*

Hyperbolic:

Y = x/(α + βx) Y* = 1/Y; x* = 1/x Regress Y* against x*

The major effort, on the part of a user, in applying multiple-regression techniques lies in identifying the relevant independent variables from the initial set and in selecting the regression model using only relevant variables. Two general approaches are common for this task:

1. Sequential Search Approach. It is consists primarily of building a regression model with an initial set of variables and then selectively adding or deleting variables until some overall criterion is satisfied or optimized.

2. Combinatorial Approach. It is, in essence, a brute-force approach, where the search is performed across all possible combinations of independent variables to determine the best regression model.

Irrespective of whether the sequential or combinatorial approach is used, the maximum benefit to model building occurs from a proper understanding of the application domain.

Additional postprocessing steps may estimate the quality of the linear regression model. Correlation analysis attempts to measure the strength of a relationship between two variables (in our case this relationship is expressed through the linear regression equation). One parameter, which shows this strength of linear association between two variables by means of a single number, is called a correlation coefficient r. Its computation requires some intermediate results in a regression analysis.

where

The value of r is between −1 and 1. Negative values for r correspond to regression lines with negative slopes and a positive r shows a positive slope. We must be very careful in interpreting the r value. For example, values of r equal to 0.3 and 0.6 only mean that we have two positive correlations, the second somewhat stronger than the first. It is wrong to conclude that r = 0.6 indicates a linear relationship twice as strong as that indicated by the value r = 0.3.

For our simple example of linear regression given at the beginning of this section, the model obtained was B = 0.8 + 0.92A. We may estimate the quality of the model using the correlation coefficient r as a measure. Based on the available data in Figure 4.3, we obtained intermediate results

and the final correlation coefficient:

A correlation coefficient r = 0.85 indicates a good linear relationship between two variables. Additional interpretation is possible. Because r2 = 0.72, we can say that approximately 72% of the variations in the values of B is accounted for by a linear relationship with A.

5.5 ANOVA


Often the problem of analyzing the quality of the estimated regression line and the influence of the independent variables on the final regression is handled through an ANOVA approach. This is a procedure where the total variation in the dependent variable is subdivided into meaningful components that are then observed and treated in a systematic fashion. ANOVA is a powerful tool that is used in many data-mining applications.

ANOVA is primarily a method of identifying which of the

Return Main Page Previous Page Next Page

®Online Book Reader