Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [208]

By Root 742 0
∧a to the site B. (This represents k passed messages.)

5. Site B computes approximate inner product (a • b) = (∧aT • ∧b)/k

So, instead of sending an m-dimensional vector to the other site, the algorithm sends only a (k + 1)-dimensional vector where k m (k is a user-defined parameter). The inner product of vectors can still be estimated accurately with lower communication load.

In the DDM literature, one of two assumptions is commonly adopted as to how data are distributed across sites: (1) homogeneously or horizontally partitioned, or (2) heterogeneously or vertically partitioned. Both viewpoints assume that the data tables at each distributed site are partitions of a single global table. It is important to stress that the global table viewpoint is strictly conceptual. It is not necessarily assumed that such a table was physically realized and partitioned to form the tables at each site. In the homogeneous case, the global table is horizontally partitioned. The tables at each site are subsets of the global table; they have exactly the same attributes. Figure 12.30a illustrates the homogeneously distributed case using an example from weather data where both tables use the same three attributes. In the heterogeneous case, the table is vertically partitioned where each site contains a subset of columns. That means sites do not have the same attributes. However, samples at each site are assumed to contain a unique identifier to facilitate matching, and Figure 12.30b illustrates this case. The tables at distributed sites have different attributes, and samples are linked through a unique identifier, Patient ID.

Figure 12.30. Horizontally versus vertically partitioned data. (a) Horizontally partitioned data; (b) vertically partitioned data.

DDM technology supports different data-mining tasks including classification, prediction, clustering, market-basket analysis, and outliers’ detection. A solution for each of these tasks may be implemented with a variety of DDM algorithms. For example, distributed Apriori has several versions for frequent itemset generation in distributed transactional database. They usually require multiple synchronizations and communication steps. Most of these implementations assume that platforms are homogeneous and therefore the data sets are partitioned evenly among the sites. However, in practice, both the data sets and the processing platforms are more likely to be heterogeneous, running multiple and different systems and tools. This leads to unbalanced data-set distributions and workloads causing additional problems in implementation.

One recent trend is online-mining technology used for monitoring in distributed sensor networks. This is because deployments of large-scale distributed sensor networks are now possible owing to hardware advances and increasing software support. Online data mining, also called data-stream mining, is concerned with extracting patterns, detecting outliers, or developing dynamic models of a system’s behavior from continuous data streams such as those generated by sensor networks. Because of the massive amount of data and the speed of which the data are generated, many data-mining applications in sensor networks require in-network processing such as aggregation to reduce sample size and communication overhead. Online data mining in sensor networks offers many additional challenges, including:

limited communication bandwidth,

constraints on local computing resources,

limited power supply,

need for fault tolerance, and

asynchronous nature of the network.

Obviously, data-mining systems have evolved in a short period of time from stand-alone programs characterized by single algorithms with little support for the entire knowledge-discovery process to integrated systems incorporating several mining algorithms, multiple users, communications, and various and heterogeneous data formats and distributed data sources. Although many DDM algorithms are developed and deployed in a variety of applications, the trend will be illustrated in this book with only one example of

Return Main Page Previous Page Next Page

®Online Book Reader