Blog

  • Introducing OSCAR! Search your notes no matter where they live!

    In our daily adventures on our computers and the Internet, we often find/create notes or documents that we would like to keep for later, that we attempt to organize. We put them in our email, in Google Docs, in Dropbox, in Gist, and countless other places. Then when we want to find them later, it’s [...]

    post-type:post
    solr
  • How does a search engine work? An educational trek through a Lucene Postings Format

    A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form. The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown [...]

    post-type:post
    solr
  • Does FoundationDB beat the CAP conjecture? Hackathon on Friday!

    We’re cohosting a hackathon with FoundationDB on Friday. For those of you not in the know, FoundationDB is a pretty exciting addition to the NoSQL space. It brings flexible, arbitrary transactionality (including atomic cross-row joins in a distributed system) to NoSQL. My colleague Doug Turnbull got very excited about the technology a few months ago [...]

    post-type:post
  • Search is Eating The World | Recap of Lucene Revolution

    Much of the crew just got back from Lucene Revolution. It was an incredible experience to hang out with the cream-of-the-crop of the Lucene/Solr community. It continues to be clear that modern applications of all stripes are increasingly driven by search as the primary UI component. Users of these applications expect rich interactivity. And because [...]

    post-type:post
    solr
  • Indexing Millions of Documents using Tika and Atomic Update

    On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the [...]

    post-type:post
    solr
  • Understanding Solr Soft Commits and Data Durability

    I ran into an interesting problem today. I was working with the first project where we legitimately needed Solr soft commits and in testing my configuration I wanted to prove to myself that the soft commits were performing as expected. Namely, I expected soft commits to flush all added documents to an in-RAM index so [...]

    post-type:post
    solr