Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining_ Concepts and Techniques - Jiawei Han [199]

By Root 1466 0

software bugs. Copy-and-paste bugs in large software programs can be identified by extended sequential pattern analysis of source programs. Plagiarized software programs can be identified based on their essentially identical program flow/loop structures. Authors' commonly used sentence substructures can be identified and used to distinguish articles written by different authors.

Frequent and discriminative patterns can be used as primitive indexing structures (known as graph indices) to help search large, complex, structured data sets and networks. These support a similarity search in graph-structured data such as chemical compound databases or XML-structured databases. Such patterns can also be used for data compression and summarization.

Furthermore, frequent patterns have been used in recommender systems, where people can find correlations, clusters of customer behaviors, and classification models based on commonly occurring or discriminative patterns (Chapter 13).

Finally, studies on efficient computation methods in pattern mining mutually enhance many other studies on scalable computation. For example, the computation and materialization of iceberg cubes using the BUC and Star-Cubing algorithms (Chapter 5) respectively share many similarities to computing frequent patterns by the Apriori and FP-growth algorithms (Chapter 6).

7.7. Summary

■ The scope of frequent pattern mining research reaches far beyond the basic concepts and methods introduced in Chapter 6 for mining frequent itemsets and associations. This chapter presented a road map of the field, where topics are organized with respect to the kinds of patterns and rules that can be mined, mining methods, and applications.

■ In addition to mining for basic frequent itemsets and associations, advanced forms of patterns can be mined such as multilevel associations and multidimensional associations, quantitative association rules, rare patterns, and negative patterns. We can also mine high-dimensional patterns and compressed or approximate patterns.

■ Multilevel associations involve data at more than one abstraction level (e.g., “buys computer” and “buys laptop”). These may be mined using multiple minimum support thresholds. Multidimensional associations contain more than one dimension. Techniques for mining such associations differ in how they handle repetitive predicates. Quantitative association rules involve quantitative attributes. Discretization, clustering, and statistical analysis that discloses exceptional behavior can be integrated with the pattern mining process.

■ Rare patterns occur rarely but are of special interest. Negative patterns are patterns with components that exhibit negatively correlated behavior. Care should be taken in the definition of negative patterns, with consideration of the null-invariance property. Rare and negative patterns may highlight exceptional behavior in the data, which is likely of interest.

■ Constraint-based mining strategies can be used to help direct the mining process toward patterns that match users' intuition or satisfy certain constraints. Many user-specified constraints can be pushed deep into the mining process. Constraints can be categorized into pattern-pruning and data-pruning constraints. Properties of such constraints include monotonicity, antimonotonicity, data-antimonotonicity, and succinctness. Constraints with such properties can be properly incorporated into efficient pattern mining processes.

■ Methods have been developed for mining patterns in high-dimensional space. This includes a pattern growth approach based on row enumeration for mining data sets where the number of dimensions is large and the number of data tuples is small (e.g., for microarray data), as well as mining colossal patterns (i.e., patterns of very long length) by a Pattern-Fusion method.

■ To reduce the number of patterns returned in mining, we can instead mine compressed patterns or approximate patterns. Compressed patterns can be mined with representative patterns defined based on the concept of clustering, and approximate

Online Book Reader

Data Mining_ Concepts and Techniques - Jiawei Han [199]

®Online Book Reader