Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [157]

By Root 769 0

(a) Y = {1 ,1 ,0 ,1 ,1} using K = 1

(b) Y = {1 ,1 ,0 ,1 ,1} using K = 3

(c) Z = {0, 1, 0, 0, 0} using K = 1

(d) Z = {0, 1, 0, 0, 0} using K = 5.

10. Implement the hierarchical agglomerative algorithm for samples with categorical values using the SMC measure of similarity.

11. Implement the partitional K-means clustering algorithm. Input samples are given in the form of a flat file.

12. Implement the incremental-clustering algorithm with iterations. Input samples are given in the form of a flat file.

13. Given the similarity matrix between five samples:

(a) Use the similarity matrix in the table to perform complete link hierarchical clustering. Show your results by drawing a dendrogram. The dendrogram should clearly show the order in which the points are merged.

(b) How many clusters exist if the threshold similarity value is 0.5. Give the elements of each cluster.

(c) If DBSCAN algorithm is applied with threshold similarity of 0.6, and MinPts ≥ 2 (required density), what are core, border,and noise points in the set of points pi given in the table. Explain.

14. Given the points x1 = {1, 0}, x2 = {0,1}, x3 = {2, 1}, and x4 = {3, 3}, suppose that these points are randomly clustered into two clusters: C1 = {x1, x3} and C2 = {x2, x4}. Apply one iteration of K-means partitional clustering algorithm and find new distribution of elements in clusters. What is the change in total square-error?

15. Answer True/False to the following statements. Discuss your answer if necessary.

(a) Running K-means with different initial seeds is likely to produce different results.

(b) Initial cluster centers have to be data points.

(c) Clustering stops when cluster centers are moved to the mean of clusters.

(d) K-means can be less sensitive to outliers if standard deviation is used instead of the average.

(e) K-means can be less sensitive to outliers if median is used instead of the average.

16. Identify the clusters in the Figure below using the center-, contiguity-, and density-based clustering. Assume center-based means K-means, contiguity-based means single link hierarchical and density-based means DBSCAN. Also indicate the number of clusters for each case, and give a brief indication of your reasoning. Note that darkness or the number of dots indicates density.

17. Derive the mathematical relationship between cosine similarity and Euclidean distance when each data object has an L2 (Euclidean) length of 1.

18. Given a similarity measure with values in the interval [0, 1], describe two ways to transform this similarity value into a dissimilarity value in the interval [0, ∞].

19. Distances between samples (A, B, C, D, and E) are given in a graphical form:

Determine single-link and complete-link dendograms for the set of the samples.

20. There is a set S consisting of six points in the plane shown as below, a = (0, 0), b = (8, 0), c = (16, 0), d = (0, 6), e = (8, 6), f = (16, 6). Now we run the k-means algorithm on those points with k = 3. The algorithm uses the Euclidean distance metric (i.e., the straight line distance between two points) to assign each point to its nearest centroid. Also we define the following:

3-starting configuration is a subset of three starting points from S that form the initial centroids, for example, {a, b, c}.

3-partition is a partition of S into k nonempty subsets, for example, {a, b, e}, {c, d}, {f} is a 3-partition.(a) How many 3-starting configurations are there?

(b) Fill in the last two columns of the following table.

3-partition An example of a 3-starting configuration that can arrive at the 3-partition after 0 or more iterations of k-means Number of unique 3-starting configurations

{a,b} {d,e} {c,f}

{a} {d} {b, c, e, f}

{a, b, d} {c} {e, f}

{a, b} {d} {c, e, f}

9.10 REFERENCES FOR FURTHER STUDY


Filippone, M., F. Camastra, F. Masulli, S. Rovetta, A Survey of Kernel and Spectral Methods for Clustering, Pattern Recognition, Vol. 41, 2008, pp. 176–190.

Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus

Return Main Page Previous Page Next Page

®Online Book Reader