Getting Dissed by Dismax – Why your incorrect assumptions about dismax/edismax are hurting search relevancy

July 2, 2013 Doug Turnbull
Category: Uncategorized

In the competition between field scores, theres little fairness.

When you learn about the dismax query parser for the first time, one of the first things you learn about is qf, pf and friends. To use (e)dismax you take a google-like query >laws about hats

What we next learn is that if we apply boosts to those fields, we can carefully tune how much impact each field has on scoring. If we take my favorite open-source law search project State Decoded as an example, we might boost fields accordingly:

qf=text^1 catch_line^5 tags^3

Weve made some estimation of how text, catch_line (the law title), and tags should be weighted. The naive user looks at this and thinks holistically, catch_line has been defined as five times as important as text. That catch_line matches will get a gentle nudge with respect to other matches. As if this:

qf=text catch_line tags

meant all matches would all be treated fairly.

This thinking is largely a fallacy. Why? Two reasons:

Lucene’s scores are field-relative and
Dismax does not apply “weights”. It chooses the max of a field score: causing either-or scoring behavior.

Different Scoring Universes

In Lucene, each field is its own little scoring universe. Usually (but not always) the laws of physics are basically the same in these universes – TF-IDF. However, the fundamental constants that feed these formulas differ per field.

TF-IDF tends to be the predominate scoring model in use. At its core, a documents score is proportional to the number of query terms it contains and inversely proportional to the number of times those query terms occur in the entire corpus. Effectively, TF-IDF scoring serves to boost the documents that are most densely filled the users query terms.

Solr/Lucene scores on a per-field basis. The factors that feed into TF-IDF can change dramatically between fields. For example, the scoring calculations that happen in a short title field, such as catch_line are going to be very different from the characteristics of a free text field, such as text.

Given these two Lucene queries:

q=catch_line:policeq=text:police

The document frequency for the term “police” will be radically different between the short catch_line field vs. the larger, free-form text field. Fields may also have configuration options enabled that influence scoring, such as omitNorms or omitTermFrequenciesAndPositions. A given field might even be configured to use some kind of custom scoring, rendering scores almost completely unrelated.

So when we switch to dismax, and search State Decoded with qf=text catch_line, what will the scoring look like? To figure that out, need to search separately first with qf=text to get text’s best matches, then with qf=catch_line to get catch_lines best matches. Lets search for both >insurancepolice

Best Matches for catch_line

3.4429686 = (MATCH) sum of:    3.4429686 = (MATCH) weight(catch_line:polic in 1768) [DefaultSimilarity], result of: (OMITTED)3.4429686 = (MATCH) sum of:    3.4429686 = (MATCH) weight(catch_line:polic in 7674) [DefaultSimilarity], result of: (OMITTED)5.2182307 = (MATCH) sum of:    5.2182307 = (MATCH) weight(catch_line:insur in 7902) [DefaultSimilarity], result of: (OMITTED)3.3893402= (MATCH) sum of:    3.3893402 = (MATCH) weight(catch_line:insur in 894) [DefaultSimilarity], result of: (OMITTED)

Best Matches for text:

1.8226589 = (MATCH) sum of:  1.8226589 = (MATCH) weight(text:polic in 13273) [DefaultSimilarity], result of: (OMITTED)    1.8226589 = score(doc=13273,freq=4.0 = termFreq=4.01.7184192 = (MATCH) sum of:  1.7184192 = (MATCH) weight(text:polic in 11078) [DefaultSimilarity], result of: (OMITTED)1.5556965= (MATCH) sum of:    1.5556965 = (MATCH) weight(text:insur in 6536) [DefaultSimilarity], result of: (OMITTED)1.5397402 =  (MATCH) sum of:    1.5397402 = (MATCH) weight(text:insur in 13116) [DefaultSimilarity], result of: (OMITTED)

Notice how the numbers line up. Scoring, being a field-relative calculation, seems to have placed a good score in the catch_line field in the 5ish range. The text field seems to be happy to report “good” in the 1.8’s.

Well we say “good” but we don’t quite have a notion of if 1.8 is truly an amazing result for that field, or a truly abysmal result. Maybe it turns out that because of the characteristics of these fields, 5 is a terrible score for catch_line while 1.8 is a truly off-the-charts amazing score for text.

Take a second to let this sink in. Our inclination is often to give title fields a big “boost” due to importance. But they may already get a pretty nice boost relative to other fields just by the nature of the scoring universe created for that field.

In short, the scores and their relative scales are completely specific to each field and unrelated to each other. This is a crucial bit of information as we consider the most important feature of a dismax:

Dismax causes either-or scoring behavior

Dismax takes the maximum score of multiple field’s scores. As weve seen, field scores come from independent measurement universes, rendering this not much better then ranking college applicants by taking a maximum of their SAT score and GPA.

Because of this, dismax can create a winner-takes-all scenario where one field’s score dominates the final ranking. All the top results could be scored best simply because one field’s scores tend to be higher by default. According to dismax, students always get sorted by SAT score, not by GPA because we can pretty much guarantee that: max(SAT(student), GPA(student)) == SAT(student).

The same thing can happen when scoring fields. In the example above, catch_line matches just happened to be higher than text matches out-of-the-gate, so our results could be hundreds of good catch_line matches first, followed by the good text matches.

Heres an example that illustrates how destructive this behavior can be. Say we have a tags field, and sometimes our query matches a tag, and sometimes it doesnt. If the tags fields scores are very high, when a tag does match, this might completely overwhelm the value of the other fields scores. So if we queried for >car insuranceinsurance

Search results for >car insurance

Law about car insurance (tag score == 100, text score = 1.9)
Law about life insurance (tag score == 99, text score = 1.6)
Law about travel insurance (tag score == 98, text score = 1.6)
Law about health insurance (tag score == 97, text score = 1.5)

Laws we SHOULD have

Law about car insurance (text score = 1.9)
Law about kids with car insurance (text score = 1.9)
Laws about dogs with car insurance (text score = 1.8)

Suddenly the mere presence of a match on this field causes our results to look rather odd. Dismax’s “either or” winner-take-all behavior has preferred the match on the Tags rather than the other fields, causing us to blow away relevant matches from other fields. Effectively dismax moves the large block of good Tags matches up to the top and disregards other potentially valuable matches.

Boosting, tie, and other mitigation options

In our previous example, the introduction tags dominated the results. Does downboosting tags help? What if we downboosted to roughly the range of the other scores? If tags tend to go from 10 (bad tags match) to 100 (good tags match) we could boost by adding ^0.01. to tags in qf.

While this does help, its not perfect. It assumes that scores will have identical distributions through the scoring space. You’ll still have winner-take-all situations occasionally. It also doesn’t work great if instead of a scale of 10 to 100, we have 99 as a terrible score and 100 as a good score for a field.

Does tie help? The tie parameter lets you layer in other field matches into your score. Instead of the maximum of the score you get:

score = max(scores) + tie * sum(otherscores)

A tie of 1.0 effectively just turns the score into something of a “DisSum” score. The sum of all the matching fields scores becomes the overall score. This might help, but if one field’s score tends to be in the 100s for a good score while another tends to be in the single digits, scoring is still going to be a “winner takes all” scenario as the sum is dominated by the larger score. Trying to normalize scoring via boosting though may make a tie parameter more valuable.

Relevancy is hard, let’s go shopping

Dismax is a great solution for a needle-in-a-haystack problem. You have many fields and it will be relatively rare that query terms match in more than one field. Dismax breaks down when we’re searching for hay in a haystack. When matches are common and more fields are brought into the dismax equation, it becomes increasingly hard to balance out the diverse measurements. Scoring can fall apart as the dismax equation gets increasingly hard to balance for more-and-more use cases.

Carefully picking the right parameters to make google-like search meaningful and relevant is hard. It’s also likely never going to be perfect, especially as we add new fields. Though you can get pretty close, eliminating all cases of weird dismax behavior will be met with diminishing returns.

What do you think? Do you have a tough relevancy problem? Tell us about it. We’d love to help!