Data Mining_ Concepts and Techniques - Jiawei Han [141]
The results are returned in ranked order so that the best is at the top. The user-specified preference generally consists of two components: a selection condition and a ranking function. Top-k queries are common in many applications like searching web databases, k-nearest-neighbor searches with approximate matches, and similarity queries in multimedia databases.
A top-k query
Consider an online used-car database, R, that maintains the following information for each car: producer (e.g., Ford, Honda), model (e.g., Taurus, Accord), type (e.g., sedan, convertible), color (e.g., red, silver), transmission (e.g., auto, manual), price, mileage, and so on. A typical top-k query over this database is
Within the dimensions (or attributes) for R, producer and type are used here as selection dimensions. The ranking function is given in the order-by clause. It specifies the ranking dimensions, price and mileage. Q1 searches for the top-5 sedans made by Ford. The entries found are ranked or sorted in ascending (asc) order, according to the ranking function. The ranking function is formulated so that entries that have price and mileage closest to the user's specified values of $10K and 30K, respectively, appear toward the top of the list.
The database may have many dimensions that could be used for selection, describing, for example, whether a car has power windows, air conditioning, or a sunroof. Users may pick any subset of dimensions and issue a top-k query using their preferred ranking function. There are many other similar application scenarios. For example, when searching for hotels, ranking functions are often constructed based on price and distance to an area of interest. Selection conditions can be imposed on, say, the hotel location district, the star rating, and whether the hotel offers complimentary treats or Internet access. The ranking functions may be linear, quadratic, or any other form.
As shown in the preceding examples, individual users may not only propose ad hoc ranking functions, but also have different data subsets of interest. Users often want to thoroughly study the data via multidimensional analysis of the top-k query results. For example, if unsatisfied by the top-5 results returned by Q1, the user may roll up on the producer dimension to check the top-5 results on all sedans. The dynamic nature of the problem imposes a great challenge to researchers. OLAP requires offline precomputation so that multidimensional analysis can be performed on-the-fly, yet the ad hoc ranking functions prohibit full materialization. A natural compromise is to adopt a semi-offline materialization and semi-online computation model.
Suppose a relation R has selection dimensions (A1, A2, …, AS) and ranking dimensions (N1, N2, …, NR). Values in each ranking dimension can be partitioned into multiple intervals according to the data and expected query distributions. Regarding the price of used cars, for example, we may have, say, these four partitions (or value ranges): ≤ 5 K, [5 − 10 K), [10 − 15 K), and ≥ 15 K. A ranking cube can be constructed by performing multidimensional aggregations on selection dimensions. We can store the count for each partition of each ranking dimension, thereby making the cube “rank-aware.” The top-k queries can be answered by first accessing the cells in the more preferred value ranges before consulting the cells in the less preferred value ranges.
Using a ranking cube to answer a top-k query
Suppose Table 5.11 shows CMT, a materialized (i.e., precomputed) cuboid of a ranking cube for used-car sales. The cuboid, CMT, is for the selection dimensions producer and type. It shows the count and corresponding tuple IDs (TIDs) for various partitions