Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [184]

By Root 883 0
this simple example the best results occurred when we first transformed the data using LSA and then used cosine similarity to measure the similarity between initial training documents in a database and a new document for comparison. Using the nearest neighbor classifier, d5 would be classified correctly as a “financial institution” document using cosine similarity for both the 10-d and 2-d cases or using Euclidean distance for the 2-d case. Euclidean distance in the 10-d case would have classified d5 incorrectly. If we used k-nearest neighbor with k = 3, then Euclidean distance in the 10-d case would have also incorrectly classified d5. Clearly, the LSA transformation affects the results of document comparisons, even in this very simple example. Results are better because LSA enable better representation of a document’s semantics.

11.8 REVIEW QUESTIONS AND PROBLEMS

1. Give specific examples of where Web-content mining, Web-structure mining, and Web-usage mining would be valuable. Discuss the benefits.

2. Given a table of linked Web pages:

Page Linked to page

A B, D, E, F

B C, D, E

C B, E, F

D A, F, E

E B, C, F

F A, B

(a) Find authorities using two iterations of the HITS algorithm.

(b) Find hubs using two iterations of the HITS algorithm.

(c) Find the PageRank scores for each page after one iteration using 0.1 as the dampening factor.

(d) Explain the HITS and PageRank authority rankings obtained in (a) and (c).

3. For the traversal log: {X, Y, Z, W, Y, A, B, C, D, Y, C, D, E, F, D, E, X, Y, A, B, M, N},

(a) find MFR;

(b) find LRS if the threshold value is 0.3 (or 30%);

(c) Find MRS.

4. Given the following text documents and assumed decomposition:

Document Text

A Web-content mining

B Web-structure mining

C Web-usage mining

D Text mining

(a) create matrix A by using term counts from the original documents;

(b) obtain rank 1, 2, and 3 approximations to the document representations;

(c) calculate the variability preserved by rank 1, 2, and 3 approximations;

(d) Manually cluster documents A, B, C, and D into two clusters.

5. Given a table of linked Web pages and a dampening factor of 0.15:

Page Linked to page

A F

B F

C F

D F

E A, F

F E

(a) find the PageRank scores for each page after one iteration;

(b) find the PageRank scores after 100 iterations, recording the absolute difference between scores per iteration (be sure to use some programming or scripting language to obtain these scores);

(c) explain the scores and rankings computed previously in parts (a) and (b). How quickly would you say that the scores converged? Explain.

6. Why is the text-refining task very important in a text-mining process? What are the results of text refining?

7. Implement the HITS algorithm and discover authorities and hubs if the input is the table of linked pages.

8. Implement the PageRank algorithm and discover central nodes in a table of linked pages.

9. Develop a software tool for discovering maximal reference sequences in a Web-log file.

10. Search the Web to find the basic characteristics of publicly available or commercial software tools for association-rule discovery. Document the results of your search.

11. Apply LSA to 20 Web pages of your choosing and compare the clusters obtained using the original term counts as attributes against the attributes derived using LSA. Comment on the successes and shortcomings of this approach.

12. What are the two main steps in mining traversal patterns using log data?

13. The XYZ Corporation maintains a set of five Web pages: {A, B, C, D, and E}. The following sessions (listed in timestamp order) have been created:

Suppose that support threshold is 30%. Find all large sequences (after building the tree).

14. Suppose a Web graph is undirected, that is, page i points to page j if and only page j points to page i. Are the following statements true or false? Justify your answers briefly.

(a) The hubbiness and authority vectors are identical, that is, for each page, its hubbiness is equal to its authority.

(b) The matrix M that we use to compute PageRank is

Return Main Page Previous Page Next Page

®Online Book Reader