Data Mining - Mehmed Kantardzic [207]
Sites exchange their results by communicating with other sites, often through message passing.
Communication between the sites is expensive and often represents a bottleneck.
Sites have resource constraints, for example, battery power in distributed sensors systems.
Sites have privacy and/or security concerns.
The system should have the ability to efficiently scale up because distributed systems today may consist of millions of nodes.
The system should have the ability to function correctly in the presence of local site failures, and also missing or incorrect data.
Obviously, the emphasis in DDM algorithms is on local computation and communication. Local algorithms for DDM can be broadly classified under two categories:
Exact Local Algorithms. These algorithms guarantee to always terminate with precisely the same result that would have to be found by a centralized algorithm. Exact local algorithms are obviously more desirable but are more difficult to develop, and in some cases seemingly not possible.
Approximate Local Algorithms. These algorithms cannot guarantee accuracy by centralized solutions. They make a balance between quality of solution and system’s responses.
Selection of a type of a local algorithm depends on the data-mining problem and application domain, including the amount of data and their dynamics. In general, approximate approaches are used in cases when the balance between accuracy and efficiency is important, and communications between sites represent a bottleneck. We will illustrate this balance between local computation and communication with a simple approximate algorithm useful in many data-mining applications. For example, if we want to compare the data vectors observed at different sites, the centralized approach will collect these vectors to the central computer and then compare the vectors using whatever metric is appropriate for the domain. DDM technology offers more efficient solutions for the problem using a simple randomized technique.
Vectors a = (a1, a2, … , am) and b = (b1, b2, … , bm) are given at two distributed sites A and B, respectively. We want to approximate the Euclidean distance between them using a small number of messages and reduced data transfer between sites A and B. Centralized solution requires that one vector is transferred to the other site, that is, m components of one vector are transferred. How does one obtain the same result with less than m data transfer? Note that the problem of computing the Euclidean distance between a pair of vectors a and b can be represented as the problem of computing the inner products as follows:
where (a • b) represents a inner product between vectors a and b defined as σ ai bi, and (a • a) is a special case of the inner product representing square of the magnitude of the vector a. The reader can easily check the previous relation. If, for example, the vectors a and b are a = (1,2,3) and b = (2,1,2), then the Euclidean distance may be calculated as d2 = 14 + 9 − 2×10 = 3. While products (a • a) and (b • b) can be computed locally, and each result is a single value, the core challenge is to develop an algorithm for distributed inner product computation (a • b). A simple, communication-efficient randomized technique for computing this inner product between two vectors observed at two different sites may consist of the following steps:
1. Vectors a and b are given on two sites, A and B, respectively. Site A sends to the site B a random number generator seed. (This is only one passed message.)
2. Both sites A and B cooperatively generate a random matrix R with dimensions k × m, where k m. Each entry in matrix R is generated independently and identically from some fixed distribution with mean 0 and a finite variance.
3. Based on matrix R, sites A and B compute their own local matrix products: ∧a = R a and ∧b = R b.
Dimensions of new local vectors ∧a and ∧b are k, and that means significantly lower than initial lengths of m.
4. Site A sends the resulting vector