Data Mining_ Concepts and Techniques - Jiawei Han [125]
■ The base cuboid, denoted by ABC (from which all the other cuboids are directly or indirectly computed). This cube is already computed and corresponds to the given 3-D array.
■ The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group-by's AB, AC, and BC. These cuboids must be computed.
■ The 1-D cuboids, A, B, and C, which respectively correspond to the group-by's A, B, and C. These cuboids must be computed.
■ The 0-D (apex) cuboid, denoted by all, which corresponds to the group-by (); that is, there is no group-by here. This cuboid must be computed. It consists of only one value. If, say, the data cube measure is count, then the value to be computed is simply the total count of all the tuples in ABC.
Figure 5.3 A 3-D array for the dimensions A, B, and C, organized into 64 chunks. Each chunk is small enough to fit into the memory available for cube computation. The ∗'s indicate the chunks from 1 to 13 that have been aggregated so far in the process.
Let's look at how the multiway array aggregation technique is used in this computation. There are many possible orderings with which chunks can be read into memory for use in cube computation. Consider the ordering labeled from 1 to 64, shown in Figure 5.3. Suppose we want to compute the b0c0 chunk of the BC cuboid. We allocate space for this chunk in chunk memory. By scanning ABC chunks 1 through 4, the b0c0 chunk is computed. That is, the cells for b0c0 are aggregated over a0 to a3. The chunk memory can then be assigned to the next chunk, b1c0, which completes its aggregation after the scanning of the next four ABC chunks: 5 through 8. Continuing in this way, the entire BC cuboid can be computed. Therefore, only one BC chunk needs to be in memory at a time, for the computation of all the BC chunks.
In computing the BC cuboid, we will have scanned each of the 64 chunks. “Is there a way to avoid having to rescan all of these chunks for the computation of other cuboids such as AC and AB?” The answer is, most definitely, yes. This is where the “multiway computation” or “simultaneous aggregation” idea comes in. For example, when chunk 1 (i.e., a0b0c0) is being scanned (say, for the computation of the 2-D chunk b0c0 of BC, as described previously), all of the other 2-D chunks relating to a0b0c0 can be simultaneously computed. That is, when a0b0c0 is being scanned, each of the three chunks (b0c0, a0c0, and a0b0) on the three 2-D aggregation planes (BC, AC, and AB) should be computed then as well. In other words, multiway computation simultaneously aggregates to each of the 2-D planes while a 3-D chunk is in memory.
Now let's look at how different orderings of chunk scanning and of cuboid computation can affect the overall data cube computation efficiency. Recall that the size of the dimensions A, B, and C is 40, 400, and 4000, respectively. Therefore, the largest 2-D plane is BC (of size 400 × 4000 = 1,600,000). The second largest 2-D plane is AC (of size 40 × 4000 = 160,000). AB is the smallest 2-D plane (of size 40 × 400 = 16,000).
Suppose that the chunks are scanned in the order shown, from chunks 1 to 64. As previously mentioned, b0c0 is fully aggregated after scanning the row containing chunks 1 through 4; b1c0 is fully aggregated after scanning chunks 5 through 8, and so on. Thus, we need to scan four chunks of the 3-D array to fully compute one chunk of the BC cuboid (where BC is the largest of the 2-D planes). In other words, by scanning in this order, one BC chunk is fully computed for each row scanned. In comparison, the complete computation of one chunk of the second largest 2-D plane, AC, requires scanning 13 chunks, given the ordering from 1 to 64. That is, a0c0 is fully aggregated only after the scanning of chunks 1, 5, 9, and 13.
Finally, the complete computation of one chunk of the smallest 2-D plane, AB, requires scanning 49 chunks. For example, a0b0 is fully aggregated after scanning chunks 1, 17, 33, and 49. Hence, AB requires the longest scan