Semantic Search with Solr and Python Numpy

Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. What’s missing? — Semantic Search!

Consider three, somewhat silly documents:

  1. Yellow banana peels.
  2. A banana is a long yellow fruit.
  3. This mystery fruit is long and yellow and has a peel.

Now what happens if you search for the term “banana”. Under normal circumstances you only get back the first and second document. But why shouldn’t you also get back the third document? It’s obviously talking about bananas!

Semantic Search via Collaborative Filtering

Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded $1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you haven’t watched and b) you are likely to enjoy.

So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same!

Here’s how the process works:

  • First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF).
  • Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed.
  • Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search.

Demo Time!

So let’s consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because we’re all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers.

So now that we’ve decided upon our demo dataset, we’re ready run the analysis. If you’d like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However I’m going execute everything from within Python:

>>> from SemanticAnalyzer import *
>>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000)
>>> tdc = TermDocCollection(source=stvc,numTopics=150) 

That last line takes a few minutes. If it’s in the AM where you are, grab a coffee. If it’s in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now let’s play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Let’s give it an easy test; here are the 30 most highly correlated words with the word ‘vader’ (as in Darth Vader).

>>> tdc.getRelatedTerms('vader',30)

Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly.

vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan

Hey not bad! Everything here seem very reasonably connected with Mr. Vader. You may notice some odd spellings here, that’s because these are the indexed terms, therefore they are stemmed. Let’s try again with a different term; this time everyone’s favorite wizard:

>>> tdc.getRelatedTerms('potter',30)

harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord

Again, pretty good! One last try, and we’ll make it a little more challenging – a vague adjective:

>>> tdc.getRelatedTerms('dark',30)

dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort

Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter.

Now since the word correlation has proven itself out, it’s time to generate the pseudo terms and post them back to Solr.

>>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1)

This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once it’s done, you can start issuing searches to Solr.

Solr Results

Here’s an example of Semantic Search using Solr:

http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark

The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back:

{
Body: " In the John Carter movie (2012), he shows off some of his powers, like 
jumping abnormally high, but I have difficulty evaluating his strength. On the one 
side, he shows great strength, as when he kills a thark warrior with one hand, but 
he is also quite mistreated by them. He also seems helpless when he is strangled 
by Tars Tarkas. Why does the strength he shows seem so inconsistent? ",
BodyBlurred: "tv great movi control kill consid hand dark side power long mutant 
fight machin light abil sauron wormtail hulk"
},
{
Body: " In the movies, the Nazgul ride black horses with armour. I was wondering 
if that is all they are, or do they have some sort of magic? Are they evil? ",
BodyBlurred: "movi black magic dark demon engin hous aveng slytherin"
},
{
Body: " The remaining Black Brother from the prologue of A Game of Thrones is 
apparently the deserter who is beheaded in the beginning of the book. But how did 
he manage to get to Winterfell from the other side of The Wall? Or did the show 
throw me off track and in the book there weren't any survivors, so the deserter is 
someone else? ",
BodyBlurred: "book watch black hole dark side plai long game demon engin light 
turn district"
},
{
Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ",
BodyBlurred: "episod dark side light"
}

Not bad – most of those topics are rather… dark. Though check out that last result. So… maybe there are still some improvements we can make! But you also have to remember that we’re dealing with word correlation here and I can only guess that somewhere else in the corpus dark side-plots and dark episodes were surely discussed.

Speaking of word correlations, check out this gem:

{
Body: " You're correct, Enterprise is the only Star Trek that fits into both the 
original and the new 2009 movie timelines. From the perspective of the Enterprise 
characters, both are possible futures, given the over-arcing conceit of the show 
was a Temporal Cold War, so its future is in flux and could line up with either of 
the timelines we're familiar with, or with an entirely different future. ",
BodyBlurred: "answer charact place klingon star trek design travel crew watch work 
movi happen enterpris featur futur exist origin 2009 chang altern timelin war to 
version event captain gener pictur tng creat iii galaxi theori return alter voyag 
entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero 
centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha"
}

The original document involves Star Trek and time-travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms… but do you see anything funny? That’s right Biff Doc and Marti made their way into the pseudo terms – likely because of their role in the popular time-travel film “Back to the Future.”

Speaking of the future …

Future Work

Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least 3 related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Cloudera’s recent foray into Solr with Mark Miller, they will also be working on the same thing.

What’s next for our work? Recommendations! (Remember, that’s how we started this conversation.) E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalogue (e.g. product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receives results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy!

Do you have a good case for Solr Semantic Search and Recommendation? We’d love to hear it, please


Check out my LinkedIn Follow me on Twitter

solr

post-type:post

2 comments on “Semantic Search with Solr and Python Numpy

  1. John,

    The technique that you describe here is called “Latent Semantic Indexing” (LSI) patented in 1988 (before TFIDF even!). AFAIK, LSI is synonymous with “Latent Semantic Analysis” (LSA) so in literature you often see both interchangeably. I strongly encourage you to read the Wikipedia entry. I don’t think it took off back when it was first thought of because it wasn’t scalable at that time. But nowadays it is because of big-data distributed computing and more research on computing SVD faster. So I think it is quite worthwhile to invest in this technology (as you have done). It can be used as the basis for document / search-result clustering, and document classification, cross-language as well as improving recall.

    FYI I’m smarter on these topics than when I first read your blog entry thanks to the fantastic Coursera course on “Intro to Recommender Systems“. I am almost done with it. I couldn’t recommend it more highly to those in the search field.

  2. @David Yep. LSI and the technique I describe here are closely related. Though I’m not sure they’re quite the same. In LSI you map the documents into a lower dimensional space using the same Singular Value Decomposition and Rank Reduction that I discuss here. But in LSI, that’s where the documents stay. Comparisons of documents are made in that lower dimensional subspace. In the method I describe above, I go one step further to map the documents back to the token space and then I stick all the tokens back into Solr with their corresponding documents. This way I can make use of Lucene’s inverted indices. Perhaps you could consider my method to be equivalent to LSI.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>