Why is Multi-term synonym mapping so hard in Solr?

There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:

  • Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
  • Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
  • Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.

One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.

Solr Synonyms – Back to Basics

In principle, synonym mapping is not that complex. Solr’s SynonymFilter searches through a stream of tokens and compares the contents to what it finds in the synonyms.txt file. The contents of the synonyms.txt file contains a series of line delimited entries that look something like this:

spiderman, spider man
television => TV

The SynonymFilter takes an “expand” parameter that can be set to either true or false. If set to false, then “spider man television” will be converted to

| TOKEN 1 | TOKEN 2   | TOKEN 3 |
+---------+-----------+---------+
|         | spiderman | TV      |

Notice that the first token is blank. This is because spiderman is a single token and spider man was two. In this way, Solr preserves the position information should it be useful for something like a phrase query. If you set expand to true, then the results are a little different. “spider man television” becomes

| TOKEN 1 | TOKEN 2   | TOKEN 3 |
+---------+-----------+---------+
| spider  | man       |         |
|         | spiderman | TV      |

There’s a few things to note here. One, is that because we expanded the synonyms, we now have both spiderman and spider man. Also note that TV did not get expanded to both TV and television; this is because the “=>” notation in the synonyms.txt file overrides the “expand” parameter. Finally, you might be surprised to see that Solr can actually have several tokens in the same position. In this case, man and spiderman are both in token position 2.

The final aspect is whether to do the synonym mapping at index time or at query time or both. The benefit of performing the mapping only at query time is that index-time synonyms expansion will bloat your index considerably. Furthermore, if you are using index-time synonym mapping, then upon any change to the synonym file, you must reindex everything! On the other hand, index-time synonym mapping will allow your term document frequencies to be more accurate, and some of the gotchas I’ll present in a moment make index-time synonyms actually start to look more appealing. There are also plausibly times when it makes sense to use synonym mapping both at query and index time, for instance when doing complicated things with hyponyms and hypernyms (which are basically synonyms, but more specific or less specific, respectively – canine -> dog -> poodle ). But these details are beyond the scope of this post.

Why are Multi-Term Synonyms so Hard?

Basically, single-term synonyms work just about exactly as you would hope they work. But you start running into problems instantly when dealing with multi-word tokens. Let’s say that your synonyms.txt file looks like this:

spider man => spiderman

Further let’s say that you have documents in your index with the text:

  1. the adventures of spiderman
  2. what hath man wrought?
  3. spiders eat insects

And let’s say that synonym mapping only happens at query time. When a user searches for spider man, which documents do you think will match? The obvious and desired answer is that document 1 and only document 1 should match, right? Instead, very non-intuitively, document 2 and 3 match, but document 1 does not! Why!?

Unfortunately, the Solr’s query parser splits queries on whitespace before passing the terms to analysis. This means that in the search for

spider man, the term spider and the term man go through analysis independently. And since the SynonymFilter never sees “spider man”, the rule “spider man => spiderman” is never matched. (For more information on this issue, refer to the LUCENE-2605 Jira issue).

How to Deal with Multi-Term Synonyms

For now, the simplest resolution to the problem outlined above is to perform synonym mapping at index time rather than query time, and make sure that synonyms are expanded. For instance, in the case above we could replace the line in synonyms.txt with the following:

spider man, spiderman

In this case the first document gets indexed with both spider man and spiderman and a search for spider man will match document 1.

However, this resolution is also flawed. When someone searches for spider man, then very most likely, they mean the one and only Spider-Man. Unfortunately, the above query for spider man will match documents 2 and 3 as well as document 1. While there is no simple way to get around this, you can at least improve the situation by incorporating phrase matching behavior into your search. Most likely, you are going to be using the edismax query parser to handle user queries. By turning on the phrase field parameters (pf, pf2, and pf3) you can boost queries with phrase matches higher than queries without phrase matches. This can ensure that document 1 will be at the top of search results for spider man.

The best resolution to the multi-term synonym problem, it seems, has yet to be invented. Some have built Solr modules that perform synonym-mapping before sending the query string to the query parser. And we’ve made good use of this solution. But it does come with the unfortunate side-effect in that synonym configuration, which rightly belongs in schema.xml, must copied over to the solrconfig.xml (which is intended to describe Solr’s behavior).

I’ve got in mind the twinklings of a query parser that might just resolve these issues, but that’s for a future time and later blog post. Till then…


Check out my LinkedIn Follow me on Twitter

solr

post-type:post

3 comments on “Why is Multi-term synonym mapping so hard in Solr?

  1. One case which you do not discuss is where your document contains spider man but not spiderman and you do index time expansion. In that case, searching for spiderman will come up empty since spiderman was not found during indexing. Will adding query time expansion (and index time) solve this problem?

  2. @Nachum Yes. Query time expansion will solve this problem… but then introduce another problem. It solves the problem because now the tokens and in the index will match the tokens and in your now expanded query. But unfortunately your query will now also match documents that just have spider or man or both spider and man, but not necessarily the one and only Spider Man! So… you can sorta fix this by making sure to look for the query as a phrase too. (If you’re using edismax, look into the p2 and p3 parameters.) This will at least boost documents that contain spider man together in the right order.

  3. For the spiderman problem could you just remap “spider man” to “spiderman” on insert and query time?

    It’s seems the main problem is that a query like “the man could climb like a spider” would not match.

    Perhaps this couldn’t be done in solr at query time and would need a preprocessor, it’s not too complicated.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>