Data Mining - Mehmed Kantardzic [204]
1. The SAR model is a classification technique that decomposes a classifier into two parts, spatial autoregression and logistic transformation. Spatial dependencies are modeled using the framework of logistic regression analysis. If the spatially dependent values yi are related to each other, then the traditional regression equation can be modified as
where W is the neighborhood relationship contiguity matrix and ρ is a parameter that reflects the strength of the spatial dependencies between the elements of the dependent variable. After the correction term ρWy is introduced, the components of the residual error vector ε are then assumed to be generated from independent and identical standard normal distributions. As in the case of classical regression, the proposed equation has to be transformed via the logistic function for binary dependent variables, and we refer to this equation as the SAR model. Notice that when ρ = 0, this equation collapses to the classical regression model. If the spatial autocorrelation coefficient is statistically significant, then SAR will quantify the presence of spatial autocorrelation in the classification model. It will indicate the extent to which variations in the dependent variable (y) are influenced by the average of neighboring observation values.
2. A spatial outlier is a spatially referenced object whose nonspatial attribute values differ significantly from those of other spatially referenced objects in its spatial neighborhood. This kind of outlier shows a local instability in values of nonspatial attributes. It represents spatially referenced objects whose nonspatial attributes are extreme relative to its neighbors, even though the attributes may not be significantly different from the entire population. For example, a new house in an old neighborhood of a growing metropolitan area is a spatial outlier based on the nonspatial attribute house age.
A variogram-cloud technique displays data points related by neighborhood relationships. For each pair of samples, the square-root of the absolute difference between attribute values at the locations versus the Euclidean distance between the locations is plotted. In data sets exhibiting strong spatial dependence, the variance in the attribute differences will increase with increasing distance between locations. Locations that are near to one another, but with large attribute differences, might indicate a spatial outlier, even though the values at both locations may appear to be reasonable when examining the dataset nonspatially. For example, the spatial data set is represented with six five-dimensional samples given in Figure 12.28a. Traditional nonspatial analysis will not discover any outliers especially because the number of samples is relatively small. However, after applying a variogram-cloud technique, assuming that the first two attributes are X-Y spatial coordinates, and the other three are characteristics of samples, the conclusion could be significantly changed. Figure 12.29 shows the variogram-cloud for this data set. This plot has some pairs of points that are out of main dense region of common distances.
Figure 12.28. An example of a variogram-cloud graph. (a) Spatial data set; (b) a critical sample’s relations in a variogram-cloud.
Figure 12.29. A variogram-cloud technique discovers an outlier.
Computation of spatial distances and distances of samples, as a part of a variogram technique, shows that