Searching with hierarchical fields using Solr

In our recent and continuing effort to make the world a better place, we have been working with the illustrious Waldo Jaquith on a project called StateDecoded. Basically, we’re making laws easily searchable and accessible by the layperson. Here, check out the predecessor project, Virginia Decoded. StateDecoded will be similar to the predecessor but with extended and matured search capabilities. And instead of just Virginia state law, with StateDecoded, any municipality will be able to download the open source project index their own laws and give their citizens better visibility in to the rules that govern them.

For this post, though, I want to focus upon one of the good Solr riddlers that we encountered related to the hierarchical nature of the documents being indexed. Laws are divided into sections, chapters, and paragraphs and we have documents at every level. In our Solr, this hierarchy is captured in a field labeled “section”. So for instance, here are 3 examples of this section field:

  • <field name="section">30</field> – A document that contains information specific to section 30.
  • <field name="section">30.4</field> – A document that contains information specific to section 30 chapter 4.
  • <field name="section">30.4.15</field> – A document that contains information specific to section 30 chapter 4 paragraph 15.

And our goal for this field is that if anyone searches for a particular section of law, that they will be given the law most specific to their request followed by the laws that are less specific. For instance, if a user searches for “30.4″, then the results should contain the documents for section 30, section 30.4, section 30.4.15, section 30.4.16, etc., and the first result should be for 30.4. Other documents such as 40.4 should not be returned.

Initially we tackled the problem with the PathHierarchyTokenizerFactory.

<field name="section" type="path" indexed="true" stored="true"/>

and

<fieldType name="path" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
  </analyzer>
</fieldType>

Using this analyzer, here’s how several example documents get tokenized:

30       -->  <30>
30.4     -->  <30>    <30.4>    
30.4.15  -->  <30>    <30.4>    <30.4.15>    
30.4.16  -->  <30>    <30.4>    <30.4.16>
30.5     -->  <30>    <30.5>
40.5     -->  <40>    <40.5>

And if anyone searches for a law, the same tokenization occurs. So that if someone wants to look at law “30.4″ then this gets tokenized to <30> and <30.4>. In this case the first 5 documents match. In particular, documents 30.4, 30.4.15, and 30.4.16 all have two matching tokens while 30 and 30.5 have one matching token. Additionally, because of the field length normalization, a match in a shorter field – say 30.4 with 2 tokens gets boosted higher than a match on a field with 3 tokens such as 30.4.15 or 30.4.16.

All things considered, the resulting documents should come in the following order:

  • 30.4
  • 30.4.15
  • 30.4.16
  • 30
  • 30.5

BUT, as fate would have it, our perfect plan has a hole in it somewhere and the documents come back in this order:

  • 30.4.15
  • 30.4
  • 30.4.16
  • 30
  • 30.5

So… what’s up?! (Solr experts reading this, do you know where our theory breaks down?)

Our theory is actually sound, but the problem lies in the practical matter of storing the field norm. Each field of every document is stored with an additional piece of information called the field norm. Ideally, the field norm would be able to hold very specific information about how long a document is, but the more accurate the field norm is, the more space you need to store that information. In the end, the Lucene developers decided to store the field norm as a single byte. This means that the field norm can store exactly 256 different values. And if you look at the Lucene TFIDFSimilarity class (the class responsible for dealing with field norms and scoring documents) you can see exactly where these byte values are getting translated into floating point values. There’s even a comment that gives a hint into what our problem might be:

The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7×10^9 to 2×10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.

Because of this lack of precision, the field norm for fields of length 2 and length 3 is the same! … And that’s why we get the weird ordering. The first three documents have the same score and are therefore returned in index ordering.

So, how do we fix this? Well, one way would be to implement our own Similarity class that uses a different byte-to-float conversion than the default. But one of the main goals for this particular project is to keep this installation of Solr as simple as possible so that anyone can set it up. Therefore, we found that an easier solution was to use a slightly different analysis technique. Consider the following schema fragments

<field name="section_descendent" type="descendent_path" indexed="true" stored="true"/>
<field name="section_ancestor" type="ancestor_path" indexed="true" stored="true"/>

and

<fieldType name="descendent_path" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
</fieldType>

<fieldType name="ancestor_path" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
  </analyzer>
</fieldType>

In this case we’re splitting up section into two fields, section_descendent and section_ancestor which are tokenized slightly differently than before. The difference is that in the previous example, the PathHierarchyTokenizer was used on both index and query. But in this case descendent_path uses the PathHierarchyTokenizer only at index time and ancestor_path uses the PathHierarchyTokenizer only at query time. (The KeywordTokenizer tokenizes the entire field as a single token.) As a result section_descendent will only match queries for sections that are a descendent of the query – so 30.4 will match 30.4 and 30.4.15 and 30.4.16, but not 30. And similarly section_ancestor will only match queries for sections that are an ancestor of the query – so 30.4 will match 30.4 and 30, but not 30.4.15.

The final piece of the puzzle is the request handler. Here we simply use an edismax request parser with a qf (query fields) set to include both descendent_path ancestor_path. Now, whenever someone queries for 30.4, then in the descendent_path they get matches on 30.4, 30.4.15, and 30.4.16; and in the ancestor_path they get matches on 30.4 and 30. The only document that matches in both fields is 30.4 thus it gets an extra boost and appears at the top of the search results followed by the other documents. Problem solved!

Now it is interesting to note that we’ve lost sibling documents (for instance 30.5 in this example). Personally, that seems fine to me, but if we really wanted it, all we have to do is again include the original section field path-tokenized on both query and at index.


Check out my LinkedIn Follow me on Twitter

solr

post-type:post

4 comments on “Searching with hierarchical fields using Solr

  1. I’m sure this is a simple answer, but what, now, if you want results in the order in which they’re installed as laws? For example, if you’re looking at legal code, it would go like this:

    • Section 30 TITLE
    • Section 30.1 SUBHEADING 1 TEXT
    • Section 30.2 SUBHEADING 2 TEXT

    • SECTION 30.33 SUBHEADING 33 TEXT

    When searching for Section 30, I’d probably want to see the results as

    Section 30
    Section 30.1

    Section 30.33

    Do the results come back this way as you’ve set them up, or do you need to do additional sorting/ranking?

  2. If you search for nothing other than “30″ then this is exactly what happens. Title 30 matches on both the ancestor_path and descendent_path and gets the double boost – so it’s at the top. All subsections match in the descendent_path and are presented in the order they were indexed – which I suppose better be in numerical order!

  3. to support multiple categories (I know its probably not relevant in the example per se) you’d use a dynamic field?

  4. I guess that depends upon what you mean here. Do you just mean having two different taxonomies applied to the same document? Though I don’t know why I would use dynamic fields. Want to give a little more detail?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>