Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [163]

By Root 911 0
This second scan directly determines all frequent itemsets in the database as a subset of a previously defined union.

In some applications, the transaction database has to be mined frequently to capture customer behavior. In such applications, the efficiency of data mining could be a more important factor than the complete accuracy of the results. In addition, in some applications the problem domain may be vaguely defined. Missing some marginal cases that have confidence and support levels at the borderline may have little effect on the quality of the solution to the original problem. Allowing imprecise results can in fact significantly improve the efficiency of the applied mining algorithm.

As the database size increases, sampling appears to be an attractive approach to data mining. A sampling-based algorithm typically requires two scans of the database. The algorithm first takes a sample from the database and generates a set of candidate itemsets that are highly likely to be frequent in the complete database. In a subsequent scan over the database, the algorithm counts these itemsets’ exact support and the support of their negative border. If no itemset in the negative border is frequent, then the algorithm has discovered all frequent itemsets. Otherwise, some superset of an itemset in the negative border could be frequent, but its support has not yet been counted. The sampling algorithm generates and counts all such potentially frequent itemsets in subsequent database scans.

Because it is costly to find frequent itemsets in large databases, incremental updating techniques should be developed to maintain the discovered frequent itemsets (and corresponding association rules) so as to avoid mining the whole updated database again. Updates on the database may not only invalidate some existing frequent itemsets but also turn some new itemsets into frequent ones. Therefore, the problem of maintaining previously discovered frequent itemsets in large and dynamic databases is nontrivial. The idea is to reuse the information of the old frequent itemsets and to integrate the support information of the new frequent itemsets in order to substantially reduce the pool of candidates to be reexamined.

In many applications, interesting associations among data items often occur at a relatively high concept level. For example, one possible hierarchy of food components is presented in Figure 10.5, where M (milk) and B (bread), as concepts in the hierarchy, may have several elementary sub-concepts. The lowest level elements in the hierarchy (M1, M2, … , B1, B2, …) are types of milk and bread defined with their bar-codes in the store. The purchase patterns in a transaction database may not show any substantial regularities at the elementary data level, such as at the bar-code level (M1, M2, M3, B1, B2, … ), but may show some interesting regularities at some high concept level(s), such as milk M and bread B.

Figure 10.5. An example of concept hierarchy for mining multiple-level frequent itemsets.

Consider the class hierarchy in Figure 10.5. It could be difficult to find high support for purchase patterns at the primitive-concept level, such as chocolate milk and wheat bread. However, it would be easy to find in many databases that more than 80% of customers who purchase milk may also purchase bread. Therefore, it is important to mine frequent itemsets at a generalized abstraction level or at multiple-concept levels; these requirements are supported by the Apriori generalized-data structure.

One extension of the Apriori algorithm considers an is-a hierarchy on database items, where information about multiple abstraction levels already exists in the database organization. An is-a hierarchy defines which items are a specialization or generalization of other items. The extended problem is to compute frequent itemsets that include items from different hierarchy levels. The presence of a hierarchy modifies the notation of when an item is contained in a transaction. In addition to the items listed explicitly, the transaction contains their

Return Main Page Previous Page Next Page

®Online Book Reader