Online Book Reader

Home Category

Data Mining_ Concepts and Techniques - Jiawei Han [323]

By Root 1527 0
that no constraints are violated at every step, it does not require any backtracking. It is a greedy algorithm for generating a clustering that satisfies all constraints, provided that no conflicts exist among the constraints.

Handling Soft Constraints

Clustering with soft constraints is an optimization problem. When a clustering violates a soft constraint, a penalty is imposed on the clustering. Therefore, the optimization goal of the clustering contains two parts: optimizing the clustering quality and minimizing the constraint violation penalty. The overall objective function is a combination of the clustering quality score and the penalty score.

To illustrate, we again use partitioning clustering as an example. Given a data set and a set of soft constraints on instances, the CVQE (Constrained Vector Quantization Error) algorithm conducts k-means clustering while enforcing constraint violation penalties. The objective function used in CVQE is the sum of the distance used in k-means, adjusted by the constraint violation penalties, which are calculated as follows.

■ Penalty of a must-link violation. If there is a must-link constraint on objects x and y, but they are assigned to two different centers, c1 and c2, respectively, then the constraint is violated. As a result, , the distance between c1 and c2, is added to the objective function as the penalty.

■ Penalty of a cannot-link violation. If there is a cannot-link constraint on objects x and y, but they are assigned to a common center, c, then the constraint is violated. The distance, , between c and c′ is added to the objective function as the penalty.

Speeding up Constrained Clustering

Constraints, such as on similarity measurements, can lead to heavy costs in clustering.

Consider the following clustering with obstacles problem: To cluster people as moving objects in a plaza, Euclidean distance is used to measure the walking distance between two points. However, a constraint on similarity measurement is that the trajectory implementing the shortest distance cannot cross a wall (Section 11.4.1). Because obstacles may occur between objects, the distance between two objects may have to be derived by geometric computations (e.g., involving triangulation). The computational cost is high if a large number of objects and obstacles are involved.

The clustering with obstacles problem can be represented using a graphical notation. First, a point, p, is visible from another point, q, in the region R if the straight line joining p and q does not intersect any obstacles. A visibility graph is the graph, , such that each vertex of the obstacles has a corresponding node in V and two nodes, v_1 and v_2, in V are joined by an edge in E if and only if the corresponding vertices they represent are visible to each other. Let be a visibility graph created from VG by adding two additional points, p and q, in V′. E′ contains an edge joining two points in V′ if the two points are mutually visible. The shortest path between two points, p and q, will be a subpath of , as shown in Figure 11.16(a). We see that it begins with an edge from p to either v_1, v_2, or v_3, goes through a path in VG, and then ends with an edge from either v_4 or v_5 to q.

Figure 11.16 Clustering with obstacle objects (o1 and o2): (a) a visibility graph and (b) triangulation of regions with microclusters. Source: Adapted from Tung, Hou, and Han [THH01].

To reduce the cost of distance computation between any two pairs of objects or points, several preprocessing and optimization techniques can be used. One method groups points that are close together into microclusters. This can be done by first triangulating the region R into triangles, and then grouping nearby points in the same triangle into microclusters, using a method similar to BIRCH or DBSCAN, as shown in Figure 11.16(b). By processing microclusters rather than individual points, the overall computation is reduced. After that, precomputation can be performed to build two kinds of join indices based on the computation of the shortest paths: (1) VV

Return Main Page Previous Page Next Page

®Online Book Reader