Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Data Mining - Mehmed Kantardzic [200]

By Root 687 0

search and retrieval, and pattern discovery. The first four have been investigated extensively in traditional time-series analysis, pattern recognition, and information retrieval. We will concentrate in this text on illustrative examples of algorithms for pattern discovery in large databases, which are of more recent origin and showing wide applicability. The problem of pattern discovery is to find and evaluate all “interesting” patterns in the data. There are many ways of defining what constitutes a pattern in the data and we shall discuss some generic approaches. There is no universal notion for interestingness of a pattern either. However, one concept that is found very useful in data mining is that of frequent patterns. A frequent pattern is one that occurs many times in the data. Much of the data-mining literature is concerned with formulating useful pattern structures and developing efficient algorithms for discovering all patterns that occur frequently in the data.

A pattern is a local structure in the database. In the sequential-pattern framework, we are given a collection of sequences, and the task is to discover sequences of items called sequential patterns that occur in sufficiently many of those sequences. In the frequent episodes analysis, the data set may be given in a single long sequence or in a large set of shorter sequences. An event sequence is denoted by {(E1, t1), (E2, t2), …, (En,, tn)}, where Ei takes values from a finite set of event types E, and ti is an integer denoting the time stamp of the ith event. The sequence is ordered with respect to the time stamps so that ti ≤ ti + 1 for all i = 1, 2, … , n. The following is an example event sequence S with 10 events in it:

An episode is a partially ordered set of events. When the order among the events of an episode is total, it is called a serial episode, and when there is no order at all, the episode is called a parallel episode. For example, (A → B → C) is a three-node serial episode. The arrows in our notation serve to emphasize the total order. In contrast, parallel episodes are somewhat similar to itemsets, and so, we can denote a three-node parallel episode with event types A, B, and C, as (ABC).

An episode is said to occur in an event sequence if there exist events in the sequence occurring with exactly the same order as that prescribed in the episode. For instance, in the example the events (A, 2), (B, 3) and (C, 8) constitute an occurrence of the serial episode (A → B → C) while the events (A, 7), (B, 3), and (C, 8) do not, because for this serial episode to occur, A must occur before B and C. Both these sets of events, however, are valid occurrences of the parallel episode (ABC), since there are no restrictions with regard to the order in which the events must occur for parallel episodes. Let α and β be two episodes. β is said to be a sub-episode of α if all the event types in β appear in α as well, and if the partial order among the event types of β is the same as that for the corresponding event types in α. For example, (A → C) is a two-node sub-episode of the serial episode (A → B → C) while (B → A) is not. In case of parallel episodes, this order constraint is not required.

The sequential pattern-mining framework may extend the frequent itemsets idea described in the chapter on association rules with temporal order. The database D of itemsets is considered no longer just some unordered collection of transactions. Now, each transaction in D carries a time stamp as well as a customer ID. Each transaction, as earlier, is simply a collection of items. The transactions associated with a single customer can be regarded as a sequence of itemsets ordered by time, and D would have one such transaction sequence corresponding to each customer. Consider an example database with five customers whose corresponding transaction sequences are as follows:

Customer ID Transaction Sequence

1 ({A,B}{A,C,D}{B,E})

2 ({D,G} {A,B,E,H})

3 ({A}{B,D}{A,B,E,F}{G,H})

4 ({A}{F})

5 ({A,D} {B,E,G,H} {F})

Each customer’s transaction sequence is enclosed in

Online Book Reader

Data Mining - Mehmed Kantardzic [200]

®Online Book Reader