Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [187]

By Root 634 0

of data. (a) Chemical compound; (b) social network; (c) genome co-expression network.

Many data sets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and a single-link type, or richer, heterogeneous networks, in which there may be multiple object and link types, and possibly other semantic information. Examples of homogeneous networks include single-mode social networks, such as people connected by friendship links, or the World Wide Web (WWW), a collection of linked Web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments, and contacts, or in bibliographic domains describing publications, authors, and venues. Graph-mining techniques explicitly consider these links when building predictive or descriptive models of the linked data.

The requirement of different applications with graph-based data sets is not very uniform. Thus, graph models and mining algorithms that work well in one domain may not work well in another. For example, chemical data is often represented as graphs in which the nodes correspond to atoms, and the links correspond to bonds between the atoms. The individual graphs are quite small although there are significant repetitions among the different nodes. Biological data are modeled in a similar way as chemical data. However, the individual graphs are typically much larger. Protein interaction networks link proteins that must work together to perform some particular biological functions. A single biological network could easily contain thousands of nodes. In the case of computer networks and the Web, the number of nodes in the underlying graph may be massive. Computer networks consist of routers/computers representing nodes, and the links between them. Since the number of nodes is massive, this can lead to a very large number of distinct edges. Social networks may be modeled with large graphs that are defined by people who appear as nodes, and links that correspond to communications or relationships between these different people. The links in the social network can be used to determine relevant communities, members with particular expertise sets, and the flow of information in the social network. For example, the problem of community detection in social networks is related to the problem of node clustering of very large graphs. In this case, we wish to determine dense clusters of nodes based on the underlying linkage structure. It is clear that the design of a particular mining algorithm depends upon the application domain at hand.

Before introducing some illustrative examples of graph-mining techniques, some basic concepts from graph theory will be summarized. Graph theory provides a vocabulary that can be used to label and denote many structural properties in data. Also, graph theory gives us mathematical operations and ideas with which many of these properties can be quantified and measured.

A graph G = G(N, L) consists of two sets of information: a set of nodes N = {n1, n2, … , nk} and a set of links between pairs of nodes L = {l1, l2,…, lm}. A graph with nodes and without links is called an empty graph, while the graph with only one node is a trivial graph. Two nodes ni and nj are adjacent if there is a link between them. A graph G(N, L) can be presented as a diagram as in Figure 12.2a in which points depict nodes, and lines between two points are links. A graph G′(N′, L′) is a subgraph of G(N, L) if N′ ⊆ N and L′ ⊆ L.

Figure 12.2. Undirected, directed, and weighted graphs.

An induced subgraph of a graph G has a subset of the nodes of G, and the same links between pairs of nodes as in G. For example, the subgraph (b) in Figure 12.3 is an induced subgraph of the graph (a), but the subgraph (c) is a general subgraph but not an induced subgraph of G, since the incoming link L1 of the node labeled N2 in (a) is not retained in (c) while the node labeled N2 is included.

Figure 12.3. An induced subgraph and a general

Online Book Reader

Data Mining - Mehmed Kantardzic [187]

®Online Book Reader