Latent Semantic Indexing

"... Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent. ..."

Description taken from: http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm


Algorithm for LSI:

To perform Latent Semantic Indexing on a group of documents, you perform the following steps: Note that adding word-pairs and word-triples can result in a much greater level of precision (still without ever needing to examine sentence or language structure) at cost of a considerably larger vector space while performing indexing. Determining what constitutes a 'word' is also often a difficult aspect of the above (especially in languages like German and WikiWiki that concatenate words to form larger words).


Uses for LSI

LSI has been used in several ways. The most obvious and common way is to analyze the similarity between bodies of text. This can be used in dozens of interesting ways, from finding related documents in a group to doing paragraph-wise LSI to find site summaries. It can also be used to facilitate a "smart" search of your document space, and even do document categorization (read: SPAM filtering!)

The most common operation is to take the inner product (dot product) of two of the LSI vectors and perform operations with that information. Sometimes, it may make sense to do this operation on your unit LSI vectors, particularly when you are searching on a vector that is a search term, it will almost never have the magnitude that a full document would have.


Example Implementation

A working example of LSI in the RubyLanguage can be found at http://classifier.rubyforge.org/. prior link was bad, but rufy.com includes this link

Tutorials

A tutorial, part of a five-part series on SVD and LSI, is given at http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html


EditText of this page (last edited December 6, 2009) or FindPage with title or text search