Blog

The Anatomy of a DisMax Query

While debugging the functionality of a new query parser, I had the poor fortune rare opportunity to dig deep into Solrs search scoring code. I ended up learning a ton about how queries are composed and scored, so I though Id pass the learning on to you so that perhaps you can avoid my fate benefit from my new-found understanding.

Our discussion here will focus upon the composition of queries made through Solrs DisMax query parser (DisMaxQParser). This was the code that I was largely working with, but it serves as an excellent example because it touches upon all of the most important query types that you might regularly deal with: Boolean queries, DisMax queries, Phrase queries, and Term queries.

The Query

So without further adieu lets take a look our sample query:

localhost:8983/solr/select?q=captain james kirk&defType=edismax&qf=Body Title^1.5&pf=Body Title&mm=2&debugQuery=true

(Those who have been following our blog recently may realize that here we are using our favorite demo data set – the SciFi StackExchange data. Its easy to understand and fun to work with. If youd like to give it a try, our Indexing StackOverflow in Solr post will get you started in 10 minutes.) Here were looking for posts containing captain james kirk (q), were using the edismax parser (defType), were looking for pure term matches in the Body and Title fields (note the boost on the Title, qf), and were looking for phrases matches on the same two fields (pf). We insist that at least 2 of the terms must match (mm), and finally, since were interested in understanding how Solr thinks about our query and about scoring, we turn on debug output (debugQuery).

The first thing that is interesting to look at is the parsedquery section of the response. This shows us how Solr breaks down that query we sent it:

+(  ( DisjunctionMaxQuery((Body:captain | Title:captain^1.5))    DisjunctionMaxQuery((Body:james | Title:james^1.5))    DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2)DisjunctionMaxQuery((Body:"captain james kirk"))DisjunctionMaxQuery((Title:"captain james kirk"))

Boolean Queries

Here we actually have examples of all the commonly used queries, Boolean, DisjunctionMax, Term, and Phrase. Lets break it down. At the outer-most level we have a Boolean query with three clauses:

  1. +(( DisjunctionMaxQuery((Body:captain | Title:captain^1.5)) DisjunctionMaxQuery((Body:james | Title:james^1.5)) DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2)
  2. DisjunctionMaxQuery((Body:"captain james kirk"))
  3. DisjunctionMaxQuery((Title:"captain james kirk"))

Now you might be asking yourself “Where are the ANDs and ORs?” Well, thats not quite how Lucene Boolean queries work. Instead, a Lucene Boolean query is composed of a set of clauses that are modified by either MUST match, MUST_NOT match, or SHOULD match. SHOULD match is the default, MUST match is indicated in the above query by a plus at the front of the clause, and MUST_NOT match, though not present in this example, is indicated by a minus at the front of the clause. So for this query, we are insisting that the first clause MUST match in every document returned from our search, the other two SHOULD match, but if they dont its ok (the document just gets a lower score). And as far as scoring goes, each matching clause of a Boolean query is scored and the final score of the Boolean query is the sum of the score of its clauses.

Lets dig one level deeper on that first clause of the Boolean query. Here we actually find yet another Boolean query, and it has three SHOULD match clauses (these clauses are all DisjunctionMaxQueries… hold your horses, were getting there). One interesting thing to note about this Boolean query is the ~2 at the end of that query. Its there because you earlier specified mm=2 in the original query. This ~2 indicates that of the three optional SHOULD match subqueries, we are insisting that at least two of these queries actually match, otherwise we disregard that document.

DisMax Queries

Lets now take a closer look at the first actual DisMax query: DisjunctionMaxQuery((Body:captain | Title:captain^1.5)). DisjunctionMax is a funny name… but no ones laughing. It means that we should search for the term in question across the disjunction of the possible fields, and return the max score from all these sub-queries. So in this case, if captain matched in the Title and matched in Body, we would find the score in both cases and return the maximum of these two as the score for the DisMax clause

Now why would you want to return only the max of the sub-query scores? Consider an example: Lets say you want to search for harry potter over the Title and Body fields. If we take the sum of the subqueries instead of the max, then a document that has “Harry Potter” in the Title but not in the Body will have the same score as a document that happens to have “Harry” in both Title and Body, but “Potter” nowhere. This is not what you want! If maximization still seems too harsh for your use case, then look up the “tie” parameter which allows you to do something between maximization and summing of clause scores.

Term Queries

Going one level deeper now, lets look at the first clause: Body:captain. Looks pretty simple, right? Were just looking for captain in the Body field. Actually, this is one of the most complex portions of the query. The score of this query is dependent upon several thing, among them:

  • How common the term captain is in this document (Term Frequency)
  • How common the term captain is across all documents (Document Frequency)
  • How large the field is in this document (Field Norm)
  • How much this query is boosted

This is actually a very interesting topic, and possibly worth a future post by itself. However, for the time being, I will just refer you to the very well documented TFIDFSimilarity class that covers these details.

Phrase Queries

This leaves one final piece, the Phrase query. Traversing back up the query parse tree, the next top level Boolean sub-clause is a DisMax query with a single clause: DisjunctionMaxQuery((Body:"captain james kirk")). As a DisMax query, this is a somewhat degenerate case because, as youll recall, DisMax takes the maximum score of all its sub-clauses; in this case there is only one clause. Regardless, its the sub-clause that were interested in right now: Body:"captain james kirk" This is a Phrase query. Basically it seeks all the documents that contain all of the specified terms and then throws away the documents in which these terms are not adjacent to one another and in this order. Scoring proceeds in a manner analogous to Term queries in that the score is proportional to the number of occurrence of the phrase in this document and inversely proportional to the number of occurrences of the these terms across the whole corpus.

Holistic Understanding of the DisMax Query

So now since weve broken down the DisMax query and taken a look at all the little pieces, lets build it back up and look once more at the DisMax query as a whole.

+(  ( DisjunctionMaxQuery((Body:captain | Title:captain^1.5))    DisjunctionMaxQuery((Body:james | Title:james^1.5))    DisjunctionMaxQuery((Body:kirk | Title:kirk^1.5)) )~2)DisjunctionMaxQuery((Body:"captain james kirk"))DisjunctionMaxQuery((Title:"captain james kirk"))

In English:

  • Were looking first and foremost for documents that contain at least two of the three terms: captain, james, and kirk. This has to match!
    • We are searching only in the Body and Title fields, and documents dont get a higher score for matching any one of these terms in both fields.
      • Oh… and Title is really important, so its score is multiplied by 1.5.
  • Next if any of the documents from the last bullet also have the phrase “captain james kirk” in the Body, then well give them a higher score.
  • Again, if any of the documents from the last bullet also have the phrase “captain james kirk” in the Title, then well give them a higher score.

And thats it! Its a little confusing at first, but once you understand whats going on, youll be better able to control the relevancy of your search results. Next up, take a look at a related post where I talk about finally killing Sea Biscuit!


Check out my LinkedIn Follow me on Twitter