Data Mining - Mehmed Kantardzic [182]
Two major shortcomings to the use of term counts are synonyms and polysemes. Synonyms refer to different words that have the same or similar meanings but are entirely different words. In the case of the vector approach, no match would be found between a query using the term “altruistic” and a document using the word “benevolent” though the meanings are quite similar. On the other hand, polysemes are words that have multiple meanings. The term “bank” could mean a financial system, to rely upon, or a type of basketball shot. All of these lead to very different types of documents, which can be problematic for document comparisons.
LSA attempts to solve these problems, not with extensive dictionaries and natural language processing engines, but by using mathematical patterns within the data itself to uncover these relationships. We do this by reducing the number of dimensions used to represent a document using a mathematical matrix operation called singular value decomposition (SVD).
Let us take a look at an example data set. This very simple data set consists of five documents. We will show the dimension reduction steps of LSA on the first four documents (), which will make up our training data. Then we will make distance comparisons to a fifth document () in our test set, using the nearest neighbor classification approach. The initial document set is:
d1: A bank will protect your money.
d2: A guard will protect a bank.
d3: Your bank shot is money.
d4: A bank shot is lucky.
d5: bank guard.
From the text data we derive a vector representation of our documents using only term counts. This vector representation could be thought of as a matrix with rows representing terms and columns representing documents. This representation may be seen in Figure 11.7.
Figure 11.7. Initial term counts.
The first step of LSA is to decompose the matrix representing our original data set of four documents, matrix A, using SVD as follows: A = USVT. The calculation of SVD is beyond the scope of this text, but there are several computing packages that will perform this operation for you (such as R or MATLAB packages). The matrices resulting from this decomposition can be seen in Figure 11.8. The U and VT matrices provide a vector of weights for terms and documents, respectively. Consider VT, this matrix gives a new 4-D representation to each document where each document is described with a corresponding column. Each of the new dimensions is derived from the original 10 word count dimensions. For example, document 1 is now represented as follows: d1 = −0.56x1 + 0.095x2 − 0.602x3 + 0.562x4, where each xn represents one of the new, derived dimensions. The S matrix is a diagonal matrix of eigenvalues for each principal component direction.
Figure 11.8. Singular value decomposition of initial data.
With this initial step, we have already reduced the number of dimensions representing each document from 10 to four. We now show how one would go about further reducing the number of dimensions. When the data set contains many documents, the previous step is not enough to reduce the number of dimensions meaningfully. To perform further reduction, we first observe that matrix S provides us with a diagonal matrix of eigenvalues in descending order as follows: λ1, … , λ4 = {3.869, 2.344, 1.758, 0.667}. We will be able to capture most of the variability of the data by retaining only the first k eigenvalues rather than all n terms (in our matrix S, n = 4). If for example k = 2, then we retain or 85% of the variability when we move from the four new dimensions per document to only two. The rank 2 approximations can be seen in Figure 11.9. This rank 2 approximation is achieved