Online Book Reader

Home Category

Data Mining - Mehmed Kantardzic [179]

By Root 912 0
with 322 million links converged in 52 iterations. These experiments were evidence that PageRank converges in log(n) time where n is number of links, and it is applicable for a growing Web. Of course, the initial version of the PageRank algorithm, explained in a simplified form in this text, had numerous modifications to evolve into the current commercial version implemented in the Google search engine.

11.6 TEXT MINING


Enormous amounts of knowledge reside today in text documents that are stored either within organizations or are freely available. Text databases are rapidly growing because of the increasing amounts of information available in electronic form, such as electronic publications, digital libraries, e-mail, and the World Wide Web. Data stored in most text databases are semi-structured, and special data-mining techniques, called text mining, have been developed for discovering new information from large collections of textual data.

In general, there are two key technologies that make online text mining possible. One is Internet search capabilities and the other is the text analysis methodology. Internet search has been around for a few years. With the explosion of Web sites in the past decade, numerous search engines designed to help users find content appeared practically overnight. Yahoo, Alta Vista, and Excite are three of the earliest, while Google and Bing are the most popular in recent years. Search engines operate by indexing the content in a particular Web site and allowing users to search these indices. With the new generation of Internet-search tools, users can gain relevant information by processing a smaller amount of links, pages, and indices.

Text analysis, as a field, has been around longer than Internet search. It has been a part of the efforts to make computers understand natural languages, and it is commonly thought of as a problem for artificial intelligence. Text analysis can be used anywhere where there is a large amount of text that needs to be analyzed. Although automatic processing of documents using different techniques does not allow the depth of analysis that a human can bring to the task, it can be used to extract key points in the text, categorize documents, and generate summaries in a situation when a large number of documents makes manual analysis impossible.

To understand the details of text documents you can either search for keywords, or you can try to categorize the semantic content of the document itself. When identifying keywords in text documents, you are looking at defining specific details or elements within documents that can be used to show connections or relationships with other documents. In the IR domain, documents have been traditionally represented in the vector space model. Documents are tokenized using simple syntactic rules (such as white-space delimiters in English), and tokens are transformed to canonical form (e.g., “reading” to “read,” “is,” “was,” and “are” to “be”). Each canonical token represents an axis in a Euclidean space. Documents are vectors in this n-dimensional space. If a token t called term occurs n times in document d, then the tth coordinate of d is simply n. One may choose to normalize the length of the document to 1, using the L1, L2, or L∞ norms:

where n(d,t) is the number of occurrences of a term t in a document d. These representations do not capture the fact that some terms, also called keywords (like “algorithm”), are more important than others (like “the” and “is”) in determining document content. If t occurs in nt out of N documents, nt/N gives a sense of rarity, and hence, the importance of the term. The inverse document frequency, IDF = 1 + log (nt/N), is used to stretch the axes of the vector space differentially. Thus the tth coordinate of document d may be represented with the value (n(d,t)/||d1 ||) × IDF(t) in the weighted vector space model. In spite of being extremely crude and not capturing any aspect of language or semantics, this model often performs well for its intended purpose. Also, in spite of minor variations,

Return Main Page Previous Page Next Page

®Online Book Reader