Data Mining_ Concepts and Techniques - Jiawei Han [196]
“Which context units should we select as context indicators?” Although a context unit can be an item, a transaction, or a pattern, typically, frequent patterns provide the most semantic information of the three. There are usually a large number of frequent patterns associated with a pattern, p. Therefore, we need a systematic way to select only the important and nonredundant frequent patterns from a large pattern set.
Considering that the closed patterns set is a lossless compression of frequent pattern sets, we can first derive the closed patterns set by applying efficient closed pattern mining methods. However, as discussed in Section 7.5, a closed pattern set is not compact enough, and pattern compression needs to be performed. We could use the pattern compression methods introduced in Section 7.5.1 or explore alternative compression methods such as microclustering using the Jaccard coefficient (Chapter 2) and then selecting the most representative patterns from each cluster.
“How, then, can we assign weights for each context indicator?” A good weighting function should obey the following properties: (1) the best semantic indicator of a pattern, p, is itself, (2) assign the same score to two patterns if they are equally strong, and (3) if two patterns are independent, neither can indicate the meaning of the other. The meaning of a pattern, p, can be inferred from either the appearance or absence of indicators.
Mutual information is one of several possible weighting functions. It is widely used in information theory to measure the mutual independency of two random variables. Intuitively, it measures how much information a random variable tells about the other. Given two frequent patterns, pα and pβ, let X = {0, 1} and Y = {0, 1} be two random variables representing the appearance of pα and pβ, respectively. Mutual information I(X; Y) is computed as
(7.18)
where , , , and . Standard Laplace smoothing can be used to avoid zero probability.
Mutual information favors strongly correlated units and thus can be used to model the indicative strength of the context units selected. With context modeling, pattern annotation can be accomplished as follows:
1. To extract the most significant context indicators, we can use cosine similarity (Chapter 2) to measure the semantic similarity between pairs of context vectors, rank the context indicators by the weight strength, and extract the strongest ones.
2. To extract representative transactions, represent each transaction as a context vector. Rank the transactions with semantic similarity to the pattern p.
3. To extract semantically similar patterns, rank each frequent pattern, p, by the semantic similarity between their context models and the context of p.
Based on these principles, experiments have been conducted on large data sets to generate semantic annotations. Example 7.16 illustrates one such experiment.
Semantic annotations generated for frequent patterns from the DBLP Computer Science Bibliography
Table 7.4 shows annotations generated for frequent patterns from a portion of the DBLP data set. 3 The DBLP data set contains papers from the proceedings of 12 major conferences in the fields of database systems, information retrieval, and data mining. Each transaction consists of two parts: the authors and the title of the corresponding paper.
3www.informatik.uni-trier.de/~ley/db/.
Table 7.4 Annotations Generated for Frequent Patterns in the DBLP Data Set
PatternTypeAnnotations
Context indicator spiros_papadimitriou; fast; use fractal; graph; use correlate
christos_faloutsos Representative transactions multi-attribute hash use gray code
Representative transactions recovery latent time-series observe sum network tomography particle filter
Representative transactions index multimedia database tutorial
Semantic similar patterns spiros_papadimitriou&christos_faloutsos; spiros_papadimitriou; flip_korn; timos_k_selli;
ramakrishnan_srikant; ramakrishnan_srikant&rakesh_agrawal