Data Mining_ Concepts and Techniques - Jiawei Han [312]
11.2.3. Types of Biclusters
“How can we model biclusters and mine them?” Let's start with some basic notation. For the sake of simplicity, we will use “genes” and “conditions” to refer to the two dimensions in our discussion. Our discussion can easily be extended to other applications. For example, we can simply replace “genes” and “conditions” by “customers” and “products” to tackle the customer-product biclustering problem.
Let be a set of genes and be a set of conditions. Let be a gene expression data matrix, that is, a gene-condition matrix, where and . A submatrix I × J is defined by a subset of genes and a subset of conditions. For example, in the matrix shown in Figure 11.5, is a submatrix.
Figure 11.5 Gene-condition matrix, a submatrix, and a bicluster.
A bicluster is a submatrix where genes and conditions follow consistent patterns. We can define different types of biclusters based on such patterns.
■ As the simplest case, a submatrix I × J is a bicluster with constant values if for any and , , where c is a constant. For example, the submatrix in Figure 11.5 is a bicluster with constant values.
■ A bicluster is interesting if each row has a constant value, though different rows may have different values. A bicluster with constant values on rows is a submatrix I × J such that for any and , , where αi is the adjustment for row i. For example, Figure 11.6 shows a bicluster with constant values on rows.
Symmetrically, a bicluster with constant values on columns is a submatrix I × J such that for any and , , where βj is the adjustment for column j.
■ More generally, a bicluster is interesting if the rows change in a synchronized way with respect to the columns and vice versa. Mathematically, a bicluster with coherent values (also known as a pattern-based cluster ) is a submatrix I × J such that for any and , , where αi and βj are the adjustment for row i and column j, respectively. For example, Figure 11.7 shows a bicluster with coherent values.
It can be shown that I × J is a bicluster with coherent values if and only if for any and , then . Moreover, instead of using addition, we can define a bicluster with coherent values using multiplication, that is, . Clearly, biclusters with constant values on rows or columns are special cases of biclusters with coherent values.
■ In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining the exact values. A bicluster with coherent evolutions on rows is a submatrix I × J such that for any and , . For example, Figure 11.8 shows a bicluster with coherent evolutions on rows. Symmetrically, we can define biclusters with coherent evolutions on columns.
Figure 11.6 Bicluster with constant values on rows.
Figure 11.7 Bicluster with coherent values.
Figure 11.8 Bicluster with coherent evolutions on rows.
Next, we study how to mine biclusters.
Biclustering Methods
The previous specification of the types of biclusters only considers ideal cases. In real data sets, such perfect biclusters rarely exist. When they do exist, they are usually very small. Instead, random noise can affect the readings of eij and thus prevent a bicluster in nature from appearing in a perfect shape.
There are two major types of methods for discovering biclusters in data that may come with noise. Optimization-based methods conduct an iterative search. At each iteration, the submatrix with the highest significance score is identified as a bicluster. The process terminates when a user-specified condition is met. Due to cost concerns in computation, greedy search is often employed to find local optimal biclusters. Enumeration methods use a tolerance threshold to specify the degree of noise allowed in the biclusters to be mined, and then tries to enumerate all submatrices of biclusters that satisfy the requirements. We use the δ-Cluster and MaPle algorithms as examples to illustrate these ideas.
Optimization Using the δ-Cluster Algorithm
For a submatrix, I × J, the mean of the i th row is
(11.16)
Symmetrically,