Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [201]

By Root 807 0
angular braces, while the items bought in a single transaction are enclosed in round braces. For example, customer 3 made four visits to the supermarket. In his/her first visit he/she bought only item A, in the second visit items B and D, and so on.

The temporal patterns of interest are sequences of itemsets. A sequence S of itemsets is denoted by {s1 s2 · · · sn}, where sj is an itemset. Since S has n itemsets, it is called an n-sequence. A sequence A = {a1 a2 · · · an} is said to be contained in another sequence B = {b1 b2 · · · bm} if there exist integers i1 < i2 < · · · < in such that a1 ⊆ bi1, a2 ⊆ bi2, . . ., an ⊆ bin. That is, an n-sequence A is contained in a sequence B if there exists an n-length subsequence in b, in which each itemset contains the corresponding itemsets of a. For example, the sequence {(A)(BC)} is contained in {(AB) (F) (BCE) (DE)} but not in {(BC) (AB) (C) (DEF)}. Further, a sequence is said to be maximal in a set of sequences, if it is not contained in any other sequence. In the set of example customer-transaction sequences listed above, all are maximal (with respect to the given set of sequences) except the sequence of customer 4, which is contained in transaction sequences of customers 3 and 5.

The Apriori algorithm described earlier can be used to find frequent sequences, except that there is a small difference in the definition of support. Earlier, the support of an itemset was defined as the fraction of all transactions that contained the itemset. Now, the support for any arbitrary sequence A is the fraction of customer transaction sequences in the database D, which contains A. For our example database, the sequence {(D)(GH)} has a support of 0.4, since it is contained in two out of the five transaction sequences (namely, that of customer 3 and customer 5). The user specifies a minimum support threshold. Any sequence of itemsets with support greater than or equal to the threshold value is called a large sequence. If a sequence A is large and maximal, then it is regarded as a sequential pattern. The process of frequent episode discovery is an Apriori-style iterative algorithm that starts with discovering frequent one-element sequences. These are then combined to form candidate two-element sequences, and then by counting their frequencies, two-element frequent sequences are obtained. This process is continued till frequent sequences of all lengths are found. The task of a sequence mining is to systematically discover all sequential patterns in database D.

Counting frequencies of parallel itemsets is straightforward and described in traditional algorithms for frequent itemsets detection. Counting serial itemsets, on the other hand, requires more computational resources. For example, unlike for parallel itemsets, we need finite-state automata to recognize serial episodes. More specifically, an appropriate l-state automaton can be used to recognize occurrences of an l-node serial sequence. For example, for the sequence (A → B → A → A), there would be a five-state automaton (FSA) given in Figure 12.23. It transits from its first state on seeing an event of type A and then waits for an event of type B to transit to its next state, and so on. We need such automata for each episode whose frequency is being counted.

Figure 12.23. FSA for the sequence A → B → A → A. *, any other symbol.

While we described the framework using an example of mining a database of customer transaction sequences for temporal buying patterns, this concept of sequential patterns is quite general and can be used in many other situations as well. Indeed, the problem of motif discovery in a database of protein sequences can also be easily addressed in this framework. Another example is Web-navigation mining. Here the database contains a sequence of Web sites that a user navigates through in each browsing session. Sequential pattern mining can be used to discover sequences of Web sites that are frequently visited. Temporal associations are particularly appropriate as candidates for causal rules’ analysis in temporally related

Return Main Page Previous Page Next Page

®Online Book Reader