Data Mining_ Concepts and Techniques - Jiawei Han [355]
13.1.2. Mining Graphs and Networks
Graphs represents a more general class of structures than sets, sequences, lattices, and trees. There is a broad range of graph applications on the Web and in social networks, information networks, biological networks, bioinformatics, chemical informatics, computer vision, and multimedia and text retrieval. Hence, graph and network mining have become increasingly important and heavily researched. We overview the following major themes: (1) graph pattern mining; (2) statistical modeling of networks; (3) data cleaning, integration, and validation by network analysis; (4) clustering and classification of graphs and homogeneous networks; (5) clustering, ranking, and classification of heterogeneous networks; (6) role discovery and link prediction in information networks; (7) similarity search and OLAP in information networks; and (8) evolution of information networks.
Graph Pattern Mining
Graph pattern mining is the mining of frequent subgraphs (also called (sub)graph patterns) in one or a set of graphs. Methods for mining graph patterns can be categorized into Apriori-based and pattern growth–based approaches. Alternatively, we can mine the set of closed graphs where a graph g is closed if there exists no proper supergraph that carries the same support count as g. Moreover, there are many variant graph patterns, including approximate frequent graphs, coherent graphs, and dense graphs. User-specified constraints can be pushed deep into the graph pattern mining process to improve mining efficiency.
Graph pattern mining has many interesting applications. For example, it can be used to generate compact and effective graph index structures based on the concept of frequent and discriminative graph patterns. Approximate structure similarity search can be achieved by exploring graph index structures and multiple graph features. Moreover, classification of graphs can also be performed effectively using frequent and discriminative subgraphs as features.
Statistical Modeling of Networks
A network consists of a set of nodes, each corresponding to an object associated with a set of properties, and a set of edges (orlinks) connecting those nodes, representing relationships between objects. A network is homogeneous if all the nodes and links are of the same type, such as a friend network, a coauthor network, or a web page network. A network is heterogeneous if the nodes and links are of different types, such as publication networks (linking together authors, conferences, papers, and contents), and health-care networks (linking together doctors, nurses, patients, diseases, and treatments).
Researchers have proposed multiple statistical models for modeling homogeneous networks. The most well-known generative models are the random graph model (i.e., the Erdös-Rényi model), the Watts-Strogatz model, and the scale-free model. The scale-free model assumes that the network follows the power law distribution (also known as the Pareto distribution or the heavy-tailed distribution). In most large-scale social networks, a small-world phenomenon is observed, that is, the network can be characterized as having a high degree of local clustering for a small