"... Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. ..."
Description taken from: http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
Algorithm for LSI
To perform Latent Semantic Indexing on a group of documents, you perform the following steps:
- First, convert each document in your index into a vector of word occurrences. The number of dimensions your vector exists in is equal to the number of unique words in the entire document set. Most document vectors will have large empty patches, some will be quite full. It is recommended that common words (e.g., "this", "him", "that", "the") are removed.
- Next, scale each vector so that every term reflects the frequency of its occurrence in context. I'll post the math for this step when I get home. Seems he didn't ever get home ;-)
- Next, combine these column vectors into a large term-document matrix. Rows represent terms, columns represent documents.
- Perform SingularValueDecomposition on the term-document matrix. This will result in three matrices commonly called U, S and V. S is of particular interest, it is a diagonal matrix of singular values for your document system.
- Set all but the k highest singular values to 0. k is a parameter that needs to be tuned based on your space. Very low values of k are very lossy, and net poor results. But very high values of k do not change the results much from simple vector search. This makes a new matrix, S'.
- Recombine the terms to form the original matrix (i.e., U * S' * V(t) = M' where (t) signifies transpose).
- Break this reduced rank term-document matrix back into column vectors. Associate these with their corresponding documents.
- Now you have a Latent Semantic Index.
Note that adding word-pairs and word-triples can result in a much greater level of precision (still without ever needing to examine sentence or language structure) at cost of a considerably larger vector space while performing indexing. Determining what constitutes a 'word' is also often a difficult aspect of the above (especially in languages like German and WikiWiki
that concatenate words to form larger words).
Uses for LSI
LSI has been used in several ways. The most obvious and common way is to analyze the similarity between bodies of text. This can be used in dozens of interesting ways, from finding related documents in a group to doing paragraph-wise LSI to find site summaries. It can also be used to facilitate a "smart" search of your document space, and even do document categorization (read: SPAM filtering!)
The most common operation is to take the inner product (dot product) of two of the LSI vectors and perform operations with that information. Sometimes, it may make sense to do this operation on your unit LSI vectors, particularly when you are searching on a vector that is a search term, it will almost never have the magnitude that a full document would have.
A working example of LSI in the RubyLanguage
can be found at http://classifier.rubyforge.org/
. prior link was bad, but rufy.com includes this link
A tutorial, part of a five-part series on SVD and LSI, is given at