Data Mining - Mehmed Kantardzic [192]
By specifying a “minimum support” value, subgraphs Gs, whose support values are above threshold, are mined as candidates or components of candidates for maximum frequent subgraphs. The main difference in an a priori implementation is that we need to define the join process of two subgraphs a little differently. Two graphs of size k can be joined if they have a structure of size (k − 1) in common. The size of this structure could be defined in terms of either nodes or edges. The algorithm starts by finding all frequent single- and double-link subgraphs. Then, in each iteration, it generates candidate subgraphs by expanding the subgraphs found in the previous iteration by one edge. The algorithm checks how many times the candidate subgraph with the extension occurs within an entire graph or set of graphs. The candidates whose frequency is below a user-defined level are pruned. The algorithm returns all subgraphs occurring more frequently than the given threshold. A naïve approach of a subgraph extension from kk − 1 size to size k is computationally very expensive as illustrated in Figure 12.12. Therefore, the candidate generation of a frequently induced subgraph is done with some constraints. Two frequent graphs are joined only when the following conditions are satisfied to generate a candidate of frequent graph of size k + 1. Let Xk and Yk be adjacency matrices of two frequent graphs G(Xk) and G(Yk) of size k. If both G(Xk) and G(Yk) have equal elements of the matrices except for the elements of the kth row and the kth column, then they may be joined to generate Zk + 1 as an adjacency matrix for a candidate graph of size k + 1.
Figure 12.12. Free extensions in graphs.
In this matrix representations, Xk−1 is the common adjacency matrix representing the graph whose size is kk − 1, while xi and yi (i = 1, 2) are (kk − 1) × 1 column vectors. These column vectors represent the differences between two graphs prepared for the join operation.
The process suffers from computational intractability when the graph size becomes too large. One problem is subgraph isomorphism, with NP complexity, as a core step in graph matching. Also, all frequent patterns may not be equally relevant in the case of graphs. In particular, patterns that are highly connected (which means dense subgraphs) are much more relevant. This additional analysis requires more computations. One possible application of discovering frequent subgraphs is a summarized representation of larger, complex graphs. After extracting common subgraphs, it is possible to simplify large graphs by condensing these subgraphs into new nodes. An illustrative example is given in Figure 12.13, where a subgraph of four nodes is replaced in the set of graphs with a single node. The resulting graphs represent summarized representation of the initial graph set.
Figure 12.13. Graph summarization through graph compression.
In recent years, significant attention has focused on studying the structural properties of networks such as the WWW, online social networks, communication networks, citation networks, and biological networks. Across these large networks, an important characteristic is that they can be characterized by the nature of the underlying graphs and subgraphs, and clustering is an often used technique for miming these large networks. The problem of graph clustering arises in two different contexts: a single large graph or large set of smaller graphs. In the first case, we wish to determine dense node clusters in a single large graph minimizing the intercluster similarity for a fixed number of clusters. This problem arises in the context of a number of applications such as graph partitioning and the minimum-cut