Data Mining - Mehmed Kantardzic [154]
Arbitrarily select a point p.
Retrieve all points density-reachable from p with respect to ε and m.
If p is a core point, a new cluster is formed or existing cluster is extended.
If p is a border point, no points are density-reachable from p, and DBSCAN visits the next point of the database.
Continue the process with other points in the database until all of the points have been processed.
Since global values for ε and m are used, DBSCAN may merge two clusters into one cluster, if two clusters of different density are “close” to each other. The are close if the distance between clusters is lower than ε.
Examples of clusters obtained by DBSCAN algorithm are illustrated in Figure 9.11. Obviously, DBSCAN finds all clusters properly, independent of the size, shape, and location of clusters to each other.
Figure 9.11. DBSCAN builds clusters of different shapes.
The main advantages of the DBSCAN clustering algorithm are as follows:
1. DBSCAN does not require the number of clusters a priori, as opposed to K means and some other popular clustering algorithms.
2. DBSCAN can find arbitrarily shaped clusters.
3. DBSCAN has a notion of noise and eliminate outliers from clusters.
4. DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database
DBSCAN also has some disadvantages. The complexity of the algorithm is still very high, although with some indexing structures it reaches O (n × log n). Finding neighbors is an operation based on distance, generally the Euclidean distance, and the algorithm may find the curse of dimensionality problem for high-dimensional data sets. Therefore, most applications of the algorithm are for low-dimensional real-world data.
9.7 BIRCH ALGORITHM
BIRCH is an efficient clustering technique for data in Euclidean vector spaces. The algorithm can efficiently cluster data with a single pass, and also it can deal effectively with outliers. BIRCH is based on the notion of a CF and a CF tree.
CF is a small representation of an underlying cluster that consists of one or many samples. BIRCH builds on the idea that samples that are close enough should always be considered as a group. CFs provide this level of abstraction with corresponding summarization of samples in a cluster. The idea is that a cluster of data samples can be represented by a triple of numbers (N, LS, SS), where N is the number of samples in the cluster, LS is the linear sum of the data points (vectors representing samples), and SS is the sum of squares of the data points. More formally, the component of vectors LS and SS are computed for every attribute X of data samples in a cluster:
In Figure 9.12 five 2-D samples are representing the cluster, and their CF summary is given with components: N = 5, LS = (16, 30), and SS = (54, 190). These are common statistical quantities, and a number of different cluster characteristics and intercluster distance measures can be derived from them. For example, we can compute the centroid for the cluster based on its CF representation, without revisiting original samples. Coordinates of the centroid are obtained by dividing the components of the LS vector by N. In our example the centroid will have the coordinates (3.2, 6.0). The reader may check on the graphical interpretation of the data (Fig. 9.12) that the position of the centroid is correct. The obtained summaries are then used instead of the original data for further clustering or manipulations with clusters. For example, if CF1 = (N1, LS1, SS1) and CF2 = (N2,