Indexing Millions of Documents using Tika and Atomic Update

April 28, 2013 OpenSource Connections
Category: Uncategorized

On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the text metadata.

Streaming large volumes of data into Solr is nothing new, but this dataset posed a unique challenge: Each patent documents translation resided in a separate file, and the location of each translation file was unknown at runtime. This meant that for every document processed we wouldnt know where its match would be. Furthermore, the translations would arrive in batches, to be added as they come. And lastly, the project needed to be open to different languages and different file formats in the future.

Our options for dealing with inconsistent data came down to: cleaning all data and organizing it before processing, or building an ingester robust enough to handle different situations.

We opted for the latter and built an ingester that would process each file individually and index the documents with an atomic update (new in Solr 4). To detect and extract the text metadata we chose Apache Tika. Tika is a document-detection and content-extraction tool useful for parsing information from many different formats.

On the surface Tika offers a simple interface to retrieve data from many sources. Our use case, however, required a deeper extraction of specific data. Using the built-in SAX parser allowed us to push Tika beyond its normal limits, and analyze XML content according to the type of information it contained.

First, we define the schema to use in Solr.

Our two copy fields, text_en and text_cn will contain all of the text that is searchable by default, and we will match each document by id. We also set up two separate field types for the different languages.

Next we specify the actual patent metadata; first in English and then in Chinese.

... ...

Now that we have our schema written we can start writing content handlers to parse out the XML.

First we need to create our own custom mimetype and register the parsing process with Tika at runtime.

      Chinese Patent

Next, we can take advantage of the fact that both the English and Chinese XML are validated against the same DTD and use a single event handler to capture the various data pieces.

return new ElementPathHandler(metadata, "invention_title_" + la, "cn-patent/cn-bibliographic-data/invention-title");

Now we can pass in any file, regardless of language, and extract the important data.

The data is then rendered into a binary key-value store that we stream into Solr as an atomic update. If the patent does not exist in the Solr index, a new document is written. If the patents translation or original Chinese text is present, then all data is simply appended to the document.

While robustness simplifies working with messy data, there are a couple drawbacks to our approach. Indexing Performance: Atomic updates are actually not updates – Solr reads all of the information for each document, merges the new data together, and then re-indexes the set. This means that every document is indexed 1.5 times. Solr Performance: Because Solr has to re-read the document, all fields must be stored, bloating the index.

Overall, the tradeoffs were worth the challenge: merging the documents outside of Solr would have increased complexity unnecessarily, and data size can be addressed through sharding and distributed storage. Building a robust ingester allowed us to handle an incomplete dataset and match up the pieces during processing.