Improve search relevancy by telling Solr exactly what you want

To be successful, (e)dismax relies on avoiding a tricky problem with its scoring strategy. As we’ve discussed, dismax scores documents by taking the maximum score of all the fields that match a query. This is problematic as one field’s scores can’t easily be related to another’s. A good “text” match might have a score of 2, while a bad “title” score might be 10. Dismax doesn’t have a notion that “10” is bad for title, it only knows 10 > 2, so title matches dominate the final search results.

vectors are fun

Please find my piece of hay!

The best case for dismax is that there’s only one field that matches a query, so the resulting scoring reflects the consistency within that field. In short, dismax thrives with needle-in-a-haystack problems and does poorly with hay-in-a-haystack problems.

We need a different strategy for documents that have fields with a large amount of overlap. We’re trying to tell the difference between very similar pieces of hay. The task is similar to needing to find a good candidate for a job. If we wanted to query a search index of job candidates for “Solr Java Developer”, we’ll clearly match many different sections of our candidates’ resumes. Because of problems with dismax, we may end up with search results heavily sorted on the “objective” field. Our top scoring result might have something like:

Goal: Work with Solr some day!

Clearly not what we want! We need the hardcore experienced folks!

I’ve switched to using a different strategy for search relevancy in these kinds of cases. Start with rudimentary yet simple scoring avoiding the wild swings of dismax. Once this is in place, give Solr a list of additive queries (via bq/bf) that describe the ideal document. Tune the multiplier on each qualification through testing and experimentation.

Simple Base Scoring

Instead of relying on qf/pf to search and take the best of multiple fields, I’ll create a grab-bag field. I’ll use Solr’s copyField directives to copy all text I want to match on into this field in the schema:

<copyField source=”resume_goal” dest=”text_all”/>
<copyField source=”resume_experience” dest=”text_all”/>
<copyField source=”resume_skills” dest=”text_all”/>

The field “text_all” becomes what Solr initially searches. The assumption here is that it’s appropriate to tokenize what goes into text_all the same way. In this kind of setup, you might also want to consider omitTermFreqsAndPositions for text_all, otherwise your scoring will be heavily biased toward the field that contributes the most tokens to text_all.

Now we can set

qf=text_all

and start searching!

Describe job qualification to Solr

Once there’s baseline, predictable scoring in place, let’s describe our ideal candidate by passing solr multiple boost queries that help bubble up the the best documents for the problem we’re trying to solve:

  1. The candidate has at least 75% of the required skills

    bq={!edismax qf=resume_skills mm=75% v=$q bq=}

  2. The candidate wants to work with the technology

    bq={!edismax qf=resume_goals v=$q bq=}

  3. The candidate has a high StackOverflow reputation

    bf=log(resume_stackoverflow_reputation)

Each of these queries lets Solr layer in an extra factor into the sorting. Notice how in the bq we set v=$q. We’re using Solr’s local param syntax to reprocess the original query against a new set of criteria. We’re also making an assumption in the first bq that resume_skills will utilize an analysis chain that will filter out tokens that are non-job skills through a combination of synonyms and filtering. It’s also important to note that this wouldn’t be the finished product. Each boost needs to be carefully tuned through testing, tweaking its impact with the ^(multiplier) syntax.

vectors are fun

Which one of you is the perfect document for this query?

One nice thing about this strategy is we’re directly telling Solr exactly what we want in an awesome candidate. It’s a bit like using Solr for a fuzzy sorter, explicitly feeding it pieces of criteria we think are “good”, tuning those criteria, then using it to find the answers that match as many pieces of criteria as we specify. It’s also easy to decide later that we want to layer on additional criteria (does the candidate have code on github that utilizes skills in the query? – how much code? – how recent is it?). We could even apply additional queries based on additional criteria like salary requirements. It’s a pretty exciting strategy. John Berryman and I have even been wondering whether this might help get at his multiple objective scoring ideas. In any case, I hope to be using it more!

Let us know what you think of this strategy! If you’ve got a tough relevancy problem, let us know, we’ve got this and plenty other relevancy tricks up our sleeves and we’d love to talk with you!

solr

post-type:post

2 comments on “Improve search relevancy by telling Solr exactly what you want

  1. Interesting. (How) have you quantified/measured whether this approach leads to more relevant results and happier end users?

    Do you have a real web site with a solid number of users and queries where you could verify this?

  2. @Otis, on a couple of projects, we work with content experts to help generate measurements of how good a document is for a query. We have some home grown tools that us make relevancy improvements on targeted queries while proving we don’t break existing queries. This helps us iterate rather quickly on relevancy. So, Sadly, no, we don’t run a major careers site, but we do have some cool things that we do to help gague overall search results quality for our projects.

    Love to show you some time how we do this some time. It may be a Lucene Rev talk I throw in.

    -Doug

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>