How does a search engine work? An educational trek through a Lucene Postings Format
A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.
The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:
doc0: author: Doug Turnbull text: where oh where are my Unicorns! doc1: author: Scott Stults text: Unicorns! Unicorns!
If we tell Lucene to split up words on whitespace (Lucene’s WhitespaceAnalyzer), specify the SimpleText codec, and write these documents out using an IndexWriter, we get Lucene’s postings serialized to disk in this nice human readable format:
field author term Doug doc 0 freq 1 pos 0 term Scott doc 1 freq 1 pos 0 term Stults doc 1 freq 1 pos 1 term Turnbull doc 0 freq 1 pos 1 field text term Unicorns! doc 0 freq 1 pos 5 doc 1 freq 2 pos 0 pos 1 term are doc 0 freq 1 pos 3 term my doc 0 freq 1 pos 4 term oh doc 0 freq 1 pos 1 term where doc 0 freq 2 pos 0 pos 2 END
What we end up with is the text of our documents reorganized into something resembling an index in the back of a book. In search jargon, this is referred to as an “inverted index”. After our text gets tokenized into terms, the resulting terms form the key to the inverted index data structure. Instead of the document pointing to a series of terms, a term points to a list of the documents that contain it. This is what makes the index “inverted”. Now when we search a field, we can simply take a search term, look in the inverted index for that term, and retrieve a list of documents that contain that term. All of this much in the same way that a set of terms in the index in the back of a book points to page numbers that contain that term.
More than just term -> page number
Although the index in a book is a useful metaphor that we can and will dutifully abuse, book indices don’t contain the frequency of that term on a page. Nor do they specify the position on the page the terms occur. This information IS captured in a postings format. The section for the “Unicorns!” term captures this notion well:
term Unicorns! doc 0 freq 1 pos 5 doc 1 freq 2 pos 0 pos 1
Why are term frequency and position very useful to search engines?
The term frequency helps us determine how relevant a document is to that term. If there are comparatively more “Unicorns”” in document 1, then it will be scored higher. This model is known as the TF-IDF model, named for the very simple math that captures this idea. As an example, extend this to our book metaphor. Let’s say you’re searching for “Edgar Allen Poe” in a book about 19th century literature. If you knew through that book’s index that there were more mentions of Edgar Allen Poe on page 15 and fewer on page 5, you’d probably be more likely to flip to page 15. This is because you probably discern that page 15 with its more frequent mentions of “Edgar Allen Poe” is more likely to be the best page to flip to. This notion falls in line with most people’s expectations when searching and forms the basis for Lucene’s search ranking.
Term positions are also useful. Imagine in our previous example if we didn’t list “Edgar Allen Poe” in the book’s index. Instead we simply had individual words “Edgar”, “Allen” and “Poe”. We could, however, figure out if there was a strict mention of the phrase “Edgar Allen Poe” if we recorded the position of each term on a page in our book’s index. For example, “Edgar” is word number 3 on page 5; “Poe” is word number 17 on page 5, etc. This would be painful and annoying for us humans, but we could do it. Nevertheless this is easy for a computer — Computers have an easier time chopping up text on whitespace and a harder time recognizing names of accomplished authors. Therefore, this is precisely what Lucene does. Lucene can quickly determine the distance and arrangement of search terms in a document at query time. If the terms “Edgar”, “Allen”, and “Poe” are hits AND are adjacent and in the correct order, then Lucene can determine this page is a hit. This ability allows Lucene to support term position aware queries such as phrase queries and span queries.
Devil in the details
So how do search engine’s work? That’s about it. Well kind-of. Its more like teaching you how to add and subtract and expecting you to derive algebra. The hard part is often actually creating and using this inverted index for a real, user-facing application. One problem alluded to above is how do you correctly analyze and tokenize content – tricky depending on your language and nature of the content. Whitespace does not cut it in most cases– like maybe you’re indexing snippets of code and want to tokenize on camelCase. Maybe you’d like to normalize your text to the root form of words. Or incorporate synonyms into the index. Also alluded above is that while TF-IDF forms the root of Lucene’s default scoring, it doesn’t capture all the variables included. There’s many tweaks along the way, including norms – biasing search results toward shorter fields – and custom boosting – users giving priority to hits from specific fields or from specific queries. There’s also the question of the search strategy you use for your application. Are you only doing direct lookup? Do you need range queries? Are single fields being searched, or multiple simultaneously? Would you like to include features such as faceting? These and many other considerations come into play when trying to get Lucene’s inverted index to do your bidding.
More fundamentally, there’s the fact that this is of course the educational implementation of an inverted index. A good, fast data structure that won’t chug for minutes to give you an answer is really required. To make a long story short, there’s lots and lots of hardcore data structures thought placed into how exactly that works. If you’d like that CS PhD inducing experience I recommend diving into the Lucene42PostingsFormat at your earliest convenience. Here’s a taste of something related to get you started. Once you get to the bottom of this postings format– call me… for a job :).