Lucene and its index

Lucene use the metaphor of the library for searching document by word. An index is used to find in which book at which position I can find a word. Index store information about frequency, position, and document for each words. This is the basic strategy of Lucene. With this information, it can find documents, compute score for sorting, building excerpt to see in which context word is taken. Words are sorted, so a search with the start of the word is available, just like a search between 2 values, and a slow search of near word is available too.

For Lucene, a document is splitted in field (title, header, description, author ...), where sentences is made with token. An index contains term : a field + a token.

Troubles with too few information in the index.

Gather words variants with stemming.

You may wont to gather words like happy and happiness to search contents more by meaning than exact word. Stemming is a nice tool for doing that, and it's included in Lucene. Rather than storing word, it just store its basic form : the stem. If you're using stemming filter with indexing, you can only do stemming search, you can't search exact spelling, neither search starting word to build a word suggestion. Indexing twice is not the right solution.

Finding a near word

Better known as the "do you mean?" feature or spelling suggestion. Lucene can do fuzzy search for a near word, but it must iterate over all the word and computing the distance between word in index and a misspelled once.

Common word with high frequency may be underweighted

The scoring in Lucene weight word with it inverse word frequency, ie : a frequent word weight less. As Eks suggests, some words like brand, or object name can be frequent and relevant.

Words that work better together

Some word are compound, and sometimes with very current words : New York, Alexander the great. Lucene can find new and york, or find the exact group "new york". But, without user specification, this group can't be overweighted in scoring.

A lexicon for Lucene

More informations can be stored for each words. Playing with words is a job for spelling tools. Lot of informations and data can be borrowed from tools like hunspell (which you can find in Openoffice or Firefox). This lexicon can be build from different list of word. Reference words from hunspell, its unmunch applications gives you 600 000 words. Extract from Wikipedia dumps can be nice for a wide vocabulary. Lexicon can also be built from your index.

Similar search

Similar words can be synonyms, but also words with a common stem. Wordnet and other thesaurus gives synonyms list. htdig, an elder search engine gives some synonyms. Openoffice provides dictionnaries, too. Distance from a synonym can be computed and stored just like suggest Automatic Meaning Discovery using Google.

For stemmed word, it's easier. If you store stemmed word with the complete word, it's trivial find words with same stem.

Finding a near word

Spelling can be done with spell tools. Restricting words list with a n-gram pattern, word size, and after computing score in a small list is far more clever than iterating over all words. Word frequency can be used to suggested the best word with minimum transformation. Some language like French is all but What You Ear Is What You Write, phonetic proximity search can be very useful.

Help while typing the query.

Suggest a complete word while typing the query can be easily done with start search in Lucene. Current web search engine can now suggest some words after the typed one. Other use is when your search gives you too many answers, a query with one more word can be suggested.

Cheating with sponsored word.

If you boost some words, answers sorting can be manipulated. Is it ethics?

Implementation

A lexicon is a classical and simple Lucene index. Each word is a Document, each meta informations are Term with a prefixed Field . Spellchecking contrib uses that. Analyzing is done with wrapped TokenAnalyzer : SnowBall, NGramTokenizer ... If you need often updated data, other persistant system like berkeley DB can be used.

Lexicon can be build from flat file, or a Directory. A common lexicon can be corrected while using it, if a word doesn't exist in the main directory, it will be deleted from the lexicon, in order to simplifying it. Lexicon and Directory can be linked, when a new information happens to the Directory, it can be reported to the Lexicon. Log can be used to index later in Lexicon, without slowing Directory writing process. Lexicon and Directory can be unsynchronized without damage.

For an easy use, Query can be filtered for more similar searching, and Searcher can be filtered for suggesting an other query if too few answers is return.

Examples

Suggestions

directory is a directory with a term bio:brown. SuggestiveSearcher wrap a Searcher and a Lexicon. It will print : bio:brown

[java]
LexiconReader lexiconReader = new DirectoryReader(directory);
Lexicon lexicon = new Lexicon(new RAMDirectory());
lexicon.addAnalyser(new NGramAnalyzer());
lexicon.read(lexiconReader);
QueryParser parser = new QueryParser("txt", new WhitespaceAnalyzer());
Query query = parser.parse("bio:brawn");
SuggestiveSearcher searcher = new SuggestiveSearcher(new IndexSearcher(directory), lexicon);
SuggestiveHits hits = searcher.searchWithSuggestions(query);
System.out.println(hits.getSuggestedQuery());

Similarity

A test with a stemming similarity : +txt:happy will become +(+txt:happy txt:happiness^0.7)

[java]
query = QueryUtils.buildSimilarQuery(query, lexicon, 0.7f);

The code

The lexicon svn.

The patch

Here is the Lexicon patch for Lucene.