Why is Multi-term synonym mapping so hard in Solr?
There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:
- Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
- Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
- Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.
One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.
Solr Synonyms – Back to Basics
In principle, synonym mapping is not that complex. Solr’s SynonymFilter searches through a stream of tokens and compares the contents to what it finds in the synonyms.txt file. The contents of the synonyms.txt file contains a series of line delimited entries that look something like this:
spiderman, spider man television => TV
The SynonymFilter takes an “expand” parameter that can be set to either true or false. If set to false, then “spider man television” will be converted to
| TOKEN 1 | TOKEN 2 | TOKEN 3 | +---------+-----------+---------+ | | spiderman | TV |
Notice that the first token is blank. This is because spiderman is a single token and spider man was two. In this way, Solr preserves the position information should it be useful for something like a phrase query. If you set expand to true, then the results are a little different. “spider man television” becomes
| TOKEN 1 | TOKEN 2 | TOKEN 3 | +---------+-----------+---------+ | spider | man | | | | spiderman | TV |
There’s a few things to note here. One, is that because we expanded the synonyms, we now have both spiderman and spider man. Also note that TV did not get expanded to both TV and television; this is because the “=>” notation in the synonyms.txt file overrides the “expand” parameter. Finally, you might be surprised to see that Solr can actually have several tokens in the same position. In this case, man and spiderman are both in token position 2.
The final aspect is whether to do the synonym mapping at index time or at query time or both. The benefit of performing the mapping only at query time is that index-time synonyms expansion will bloat your index considerably. Furthermore, if you are using index-time synonym mapping, then upon any change to the synonym file, you must reindex everything! On the other hand, index-time synonym mapping will allow your term document frequencies to be more accurate, and some of the gotchas I’ll present in a moment make index-time synonyms actually start to look more appealing. There are also plausibly times when it makes sense to use synonym mapping both at query and index time, for instance when doing complicated things with hyponyms and hypernyms (which are basically synonyms, but more specific or less specific, respectively – canine -> dog -> poodle ). But these details are beyond the scope of this post.
Why are Multi-Term Synonyms so Hard?
Basically, single-term synonyms work just about exactly as you would hope they work. But you start running into problems instantly when dealing with multi-word tokens. Let’s say that your synonyms.txt file looks like this:
spider man => spiderman
Further let’s say that you have documents in your index with the text:
- the adventures of spiderman
- what hath man wrought?
- spiders eat insects
And let’s say that synonym mapping only happens at query time. When a user searches for spider man, which documents do you think will match? The obvious and desired answer is that document 1 and only document 1 should match, right? Instead, very non-intuitively, document 2 and 3 match, but document 1 does not! Why!?
Unfortunately, the Solr’s query parser splits queries on whitespace before passing the terms to analysis. This means that in the search for
spider man, the term spider and the term man go through analysis independently. And since the SynonymFilter never sees “spider man”, the rule “spider man => spiderman” is never matched. (For more information on this issue, refer to the LUCENE-2605 Jira issue).
How to Deal with Multi-Term Synonyms
For now, the simplest resolution to the problem outlined above is to perform synonym mapping at index time rather than query time, and make sure that synonyms are expanded. For instance, in the case above we could replace the line in synonyms.txt with the following:
spider man, spiderman
In this case the first document gets indexed with both spider man and spiderman and a search for spider man will match document 1.
However, this resolution is also flawed. When someone searches for spider man, then very most likely, they mean the one and only Spider-Man. Unfortunately, the above query for spider man will match documents 2 and 3 as well as document 1. While there is no simple way to get around this, you can at least improve the situation by incorporating phrase matching behavior into your search. Most likely, you are going to be using the edismax query parser to handle user queries. By turning on the phrase field parameters (pf, pf2, and pf3) you can boost queries with phrase matches higher than queries without phrase matches. This can ensure that document 1 will be at the top of search results for spider man.
The best resolution to the multi-term synonym problem, it seems, has yet to be invented. Some have built Solr modules that perform synonym-mapping before sending the query string to the query parser. And we’ve made good use of this solution. But it does come with the unfortunate side-effect in that synonym configuration, which rightly belongs in schema.xml, must copied over to the solrconfig.xml (which is intended to describe Solr’s behavior).
I’ve got in mind the twinklings of a query parser that might just resolve these issues, but that’s for a future time and later blog post. Till then…