Build your own Custom Lucene query and scorer

January 20, 2014 Doug Turnbull
Category: Lucene

Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other.

Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own custom Lucene Query.

This is the Nuclear Option

Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths:

You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem
You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work.
You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem.
You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted.
You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc.
You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control.

If you’re still reading you either think this is going to be fun/educational (good for you!) or you’re one of the minority that must control exactly what happens with search. If you don’t know, you can of course contact us for professional services.

Ok back to the action…

Refresher – Lucene Searching 101

Recall that to search in Lucene, we need to get a hold of an IndexSearcher. This IndexSearcher performs search over an IndexReader. Assuming we’ve created an index, with these classes we can perform searches like in this code:

Directory dir = new RAMDirectory();
IndexReader idxReader = new IndexReader(dir);
idxSearcher idxSearcher = new IndexSearcher(idxReader);
Query q = new TermQuery(new Term(“field”, “value”));
idxSearcher.search(q);

Let’s summarize the objects we’ve created:

Directory – Lucene’s interface to a file system. This is pretty straight-forward. We won’t be diving in here.
IndexReader – Access to data structures in Lucene’s inverted index. If we want to look up a term, and visit every document it exists in, this is where we’d start. If we wanted to play with term vectors, offsets, or anything else stored in the index, we’d look here for that stuff as well.
IndexSearcher – wraps an IndexReader for the purpose of taking search queries and executing them.
Query – How we expect the searcher to perform the search, encompassing both scoring and which documents are returned. In this case, we’re searching for “value” in field “field”. This is the bit we want to toy with

In addition to these classes, we’ll mention a support class exists behind the scenes:

Similarity – Defines rules/formulas for calculating norms at index time and query normalization.

Now with this outline, let’s think about a custom Lucene Query we can implement to help us learn. How about a query that searches for terms backwards. If the document matches a term backwards (like ananab for banana), we’ll return a score of 5.0. If the document matches the forwards version, let’s still return the document, with a score of 1.0 instead. Well call this Query “BackwardsTermQuery”.

This example is hosted here on github.

A tale of 3 classes – A Query, A Weight, and a Scorer

Before we sling code, let’s talk about general architecture.

A Lucene Query follows this general structure:

A custom Query class, inheriting from Query
A custom Weight class, inheriting from Weight
A custom Scorer class inheriting from Scorer

These three objects wrap each other. A Query creates a Weight, and a Weight in turn creates a Scorer.

A Query is itself a very straight-forward class. One of its main responsibilities when passed to the IndexSearcher is to create a Weight instance. Other than that, there are additional responsibilities to Lucene and users of your Query to consider, that we’ll discuss in the “Query” section below.

A Query creates a Weight. Why? Lucene needs a way to track IndexSearcher level statistics specific to each query while retaining the ability to reuse the query across multiple IndexSearchers. This is the role of the Weight class. When performing a search, IndexSearcher asks the Query to create a Weight instance. This instance becomes the container for holding high-level statistics for the Query scoped to this IndexSearcher (we’ll go over these steps more in the “Weight” section below). The IndexSearcher safely owns the Weight, and can abuse and dispose of it as needed. If later the Query gets reused by another IndexSearcher, a new Weight simply gets created.

Once an IndexSearcher has a Weight, and has calculated any IndexSearcher level statistics, the IndexSearcher’s next task is to find matching documents and score them. To do this, the Weight in turn creates a Scorer. Just as the Weight is tied closely to an IndexSearcher, a Scorer is tied to an individual IndexReader. Now this may seem a little odd – in our code above the IndexSearcher always has exactly one IndexReader right? Not quite. See, a little hidden implementation detail is that IndexReaders may actually wrap other smaller IndexReaders – each tied to a different segment of the index. Therefore, an IndexSearcher needs to have the ability score documents across multiple, independent IndexReaders. How your scorer should iterate over matches and score documents is outlined in the “Scorer” section below.

So to summarize, we can expand the last line from our example above…

idxSearcher.search(q);

… into this psuedocode:

Weight w = q.createWeight(idxSearcher);
// IndexSearcher level calculations for weight
Foreach IndexReader idxReader:
    Scorer s = w.scorer(idxReader);    // collect matches and score them

Now that we have the basic flow down, let’s pick apart the three classes in a little more detail for our custom implementation.

Our Custom Lucene Query

What should our custom Lucene Query implementation look like? Query implementations always have two audiences: (1) Lucene and (2) users of your Query implementation. For your users, expose whatever methods you require to modify how a searcher matches and scores with your query. Want to only return as a match 1/3 of the documents that match the query? Want to punish the score because the document length is longer than the query length? Add the appropriate modifier on the query that impacts the scorer’s behavior.

For our BackwardsTermQuery, we don’t expose accessors to modify the behavior of the search. The user simply uses the constructor to specify the term and field to search. In our constructor, we will simply be reusing Lucene’s existing TermQuery for searching individual terms in a document.

private TermQuery backwardsQuery;
private TermQuery forwardsQuery;
public BackwardsTermQuery(String field, String term) {
    // A wrapped TermQuery for the reverse string
    Term backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString());
    backwardsQuery = new TermQuery(backwardsTerm);
    // A wrapped TermQuery for the Forward
    Term forwardsTerm = new Term(field, term);
    forwardsQuery = new TermQuery(forwardsTerm);
}

Just as importantly, be sure your Query meets the expectation of Lucene. Most importantly, you MUST override the following.

createWeight()
hashCode()
equals()

The method createWeight() we’ve discussed. This is where you’ll create a weight instance for an IndexSearcher. Pass any parameters that will influence the scoring algorithm, as the Weight will in turn be creating a searcher.

Even though they are not abstract methods, overriding the hashCode()/equals() methods is very important. These methods are used by Lucene/Solr to cache queries/results. If two queries are equal, there’s no reason to rerun the query. Running another instance of your query could result in seeing the results of your first query multiple times. You’ll see your search for “peas” work great, then you’ll search for “bananas” and see “peas” search results. Override equals() and hashCode() so that “peas” != bananas.

Our BackwardsTermQuery implements createWeight() by creating a custom BackwardsWeight that we’ll cover below:

@Overridepublic Weight createWeight(IndexSearcher searcher) throws IOException {
    return new BackwardsWeight(searcher);
}

BackwardsTermQuery has a fairly boilerplate equals() and hashCode() that passes through to the wrapped TermQuerys. Be sure equals() includes all the boilerplate stuff such as the check for self-comparison, the use of the super equals operator, the class comparison, etc etc. By using Lucene’s unit test suite, we can get a lot of good checks that our implementation of these is correct.

@Overridepublic boolean equals(Object other) {
    if (this == other) {
        return true;
    }
    if (!super.equals(other)) {
        return false;
    }
    if (getClass() != other .getClass()) {
        return false;
    }
    BackwardsTermQuery otherQ = (BackwardsTermQuery)(other);
    if (otherQ.getBoost() != getBoost()) {
        return false;
    }
    return otherQ.backwardsQuery.equals(backwardsQuery) && otherQ.forwardsQuery.equals(forwardsQuery);
}
@Overridepublic int hashCode() {
    return super.hashCode() + backwardsQuery.hashCode() + forwardsQuery.hashCode();
}

Our Custom Weight

You may choose to use Weight simply as a mechanism to create Scorers (where the real meat of search scoring lives). However, your Custom Weight class must at least provide boilerplate implementations of the query normalization methods even if you largely ignore what is passed in:

getValueForNormalization
normalize

These methods participate in a little ritual that IndexSearcher puts your Weight through with the Similarity for query normalization. To summarize the query normalization code in the IndexSearcher:

float v = weight.getValueForNormalization();
float norm = getSimilarity().queryNorm(v);
weight.normalize(norm, 1.0f);

Great, what does this code do? Well a value is extracted from Weight. This value is then passed to a Similarity instance that “normalizes” that value. Weight then receives this normalized value back. In short, this is allowing IndexSearcher to give weight some information about how its “value for normalization” compares to the rest of the stuff being searched by this searcher.

This is extremely high level, “value for normalization” could mean anything, but here it generally means “what I think is my weight” and what Weight receives back is what the searcher says “no really here is your weight”. The details of what that means depend on the Similarity and Weight implementation. It’s expected that the Weight’s generated Scorer will use this normalized weight in scoring. You can chose to do whatever you want in your own Scorer including completely ignoring what’s passed to normalize().

While our Weight isn’t factoring into the scoring calculation, for consistency sake, we’ll participate in the little ritual by overriding these methods:

@Overridepublic float getValueForNormalization() throws IOException {
    return backwardsWeight.getValueForNormalization() +
             forwardsWeight.getValueForNormalization();
}
@Overridepublic void normalize(float norm, float topLevelBoost) {
    backwardsWeight.normalize(norm, topLevelBoost);
    forwardsWeight.normalize(norm, topLevelBoost);
}

Outside of these query normalization details, and implementing “scorer”, little else happens in the Weight. However, you may perform whatever else that requires an IndexSearcher in the Weight constructor. In our implementation, we don’t perform any additional steps with IndexSearcher.

The final and most important requirement of Weight is to create a Scorer. For BackwardsWeight we construct our custom BackwardsScorer, passing scorers created from each of the wrapped queries to work with.

@Overridepublic Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder,
        boolean topScorer, Bits acceptDocs) throws IOException {
    Scorer backwardsScorer = backwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs);
    Scorer forwardsScorer = forwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs);
    return new BackwardsScorer(this, context, backwardsScorer, forwardsScorer);
}

Our Custom Scorer

The Scorer is the real meat of the search work. Responsible for identifying matches and providing scores for those matches, this is where the lion share of our customization will occur.

It’s important to note that a Scorer is also a Lucene DocIdSetIterator. A DocIdSetIterator is a cursor into a set of documents in the index. It provides three important methods:

docID() – what is the id of the current document? (this is an internal Lucene ID, not the Solr “id” field you might have in your index)
nextDoc() – advance to the next document
advance(target) – advance (seek) to the target

One uses a DocIdSetIterator by first calling nextDoc() or advance() and then reading the docID to get the iterator’s current location. The value of the docIDs only increase as they are iterated over.

By implementing this interface a Scorer acts as an iterator over matches in the index. A Scorer for the query “field1:cat” can be iterated over in this manner to return all the documents that match the cat query. In fact, if you recall from my article, this is exactly how the terms are stored in the search index.

You can chose to either figure out how to correctly iterate through the documents in a search index, or you can use the other Lucene queries as building blocks. The latter is often the simplest. For example, if you wish to iterate over the set of documents containing two terms, simply use the scorer corresponding to a BooleanQuery for iteration purposes.

The first method of our scorer to look at is docID(). It works by reporting the lowest docID() of our underlying scorers. This scorer can be thought of as being “before” the other in the index, and as we want to report numerically increasing docIDs, we always want to chose this value:

@Overridepublic int docID() {
    int backwordsDocId = backwardsScorer.docID();
    int forwardsDocId = forwardsScorer.docID();
    if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) {
        currScore = BACKWARDS_SCORE;
        return backwordsDocId;
    } else if (forwardsDocId != NO_MORE_DOCS) {
        currScore = FORWARDS_SCORE;
        return forwardsDocId;
    }
    return NO_MORE_DOCS;
}

Similarly, we always want to advance the scorer with the lowest docID, moving it ahead. Then, we report our current position by returning docID() which as weve just seen will report the docID of the scorer that advanced the least in the nextDoc() operation.

@Overridepublic int nextDoc() throws IOException {
    int currDocId = docID();
    // increment one or both
    if (currDocId == backwardsScorer.docID()) {
        backwardsScorer.nextDoc();
    }
    if (currDocId == forwardsScorer.docID()) {
        forwardsScorer.nextDoc();
    }
    return docID();}

In our advance() implementation, we allow each Scorer to advance. An advance() implementation promises to either land docID() exactly on or past target. Our call to docID() after we call advance will return either that one or both are on target, or it will return the lowest docID past target.

@Overridepublic int advance(int target) throws IOException {
    backwardsScorer.advance(target);
    forwardsScorer.advance(target);
    return docID();
}

What a Scorer adds on top of DocIdSetIterator is the “score” method. When score() is called, a score for the current document (the doc at docID) is expected to be returned. Using the full capabilities of the IndexReader, any number of information stored in the index can be consulted to arrive at a score either in score() or while iterating documents in nextDoc()/advance(). Given the docId, you’ll be able to access the term vector for that document (if available) to perform more sophisticated calculations.

In our query, we’ll simply keep track as to whether the current docID is from the wrapped backwards term scorer, indicating a match on the backwards term, or the forwards scorer, indicating a match on the normal, unreversed term. Recall docID() is always called on advance/nextDoc. Youll notice we update currScore in docID, updating it every time the document advances.

@Overridepublic float score() throws IOException {
    return currScore;
}

A Note on Unit Testing

Now that we have an implementation of a custom Lucene query, we’ll want to test it! I highly recommend using Lucene’s test framework. Lucene will randomly inject different implementations of various support classes, index implementations, to throw your code off balance. Additionally, Lucene creates test implementations of classes such as IndexReader that work to check whether your Query correctly fulfills its contract. In my work, I’ve had numerous cases where tests would fail intermittently, pointing to places where my use of Lucene’s data structures subtly violated the expected contract.

An example unit test is included in the github project associated with this blog post.

Wrapping Up

That’s a lot of stuff! And I didn’t even cover everything there is to know! As an exercise to the reader, you can explore the Scorer methods cost() and freq(), as well as the rewrite() method of Query used optionally for optimization.

Additionally, I haven’t explored how most of the traditional search queries end up using a framework of Scorers/Weights that don’t actually inherit from Scorer or Weight known as “SimScorer” and “SimWeight”. These support classes consult a Similarity instance to customize calculation certain search statistics such as tf, convert a payload to a boost, etc.

In short theres a lot here! So tread carefully, there’s plenty of fiddly bits out there! But have fun! Creating a custom Lucene query is a great way to really understand how search works, and the last resort short in solving relevancy problems short of creating your own search engine.