Data Mining - Mehmed Kantardzic [200]
A pattern is a local structure in the database. In the sequential-pattern framework, we are given a collection of sequences, and the task is to discover sequences of items called sequential patterns that occur in sufficiently many of those sequences. In the frequent episodes analysis, the data set may be given in a single long sequence or in a large set of shorter sequences. An event sequence is denoted by {(E1, t1), (E2, t2), …, (En,, tn)}, where Ei takes values from a finite set of event types E, and ti is an integer denoting the time stamp of the ith event. The sequence is ordered with respect to the time stamps so that ti ≤ ti + 1 for all i = 1, 2, … , n. The following is an example event sequence S with 10 events in it:
An episode is a partially ordered set of events. When the order among the events of an episode is total, it is called a serial episode, and when there is no order at all, the episode is called a parallel episode. For example, (A → B → C) is a three-node serial episode. The arrows in our notation serve to emphasize the total order. In contrast, parallel episodes are somewhat similar to itemsets, and so, we can denote a three-node parallel episode with event types A, B, and C, as (ABC).
An episode is said to occur in an event sequence if there exist events in the sequence occurring with exactly the same order as that prescribed in the episode. For instance, in the example the events (A, 2), (B, 3) and (C, 8) constitute an occurrence of the serial episode (A → B → C) while the events (A, 7), (B, 3), and (C, 8) do not, because for this serial episode to occur, A must occur before B and C. Both these sets of events, however, are valid occurrences of the parallel episode (ABC), since there are no restrictions with regard to the order in which the events must occur for parallel episodes. Let α and β be two episodes. β is said to be a sub-episode of α if all the event types in β appear in α as well, and if the partial order among the event types of β is the same as that for the corresponding event types in α. For example, (A → C) is a two-node sub-episode of the serial episode (A → B → C) while (B → A) is not. In case of parallel episodes, this order constraint is not required.
The sequential pattern-mining framework may extend the frequent itemsets idea described in the chapter on association rules with temporal order. The database D of itemsets is considered no longer just some unordered collection of transactions. Now, each transaction in D carries a time stamp as well as a customer ID. Each transaction, as earlier, is simply a collection of items. The transactions associated with a single customer can be regarded as a sequence of itemsets ordered by time, and D would have one such transaction sequence corresponding to each customer. Consider an example database with five customers whose corresponding transaction sequences are as follows:
Customer ID Transaction Sequence
1 ({A,B}{A,C,D}{B,E})
2 ({D,G} {A,B,E,H})
3 ({A}{B,D}{A,B,E,F}{G,H})
4 ({A}{F})
5 ({A,D} {B,E,G,H} {F})
Each customer’s transaction sequence is enclosed in