Easy k-NN Document Classification with Solr and Python

You’ve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…

Build a k-Nearest Neighbors Classifier!

The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? That’s right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… they’re all just right.

Finding a Data Set and Prepping Solr

For this little experiment we need a good example data set. Let’s use the same Sci-Fi StackExchange dataset that we’ve used several times in the past. This is a great dataset; it contains roughly 18,000 question and answer documents that cover every corner of sci-fi imaginable. It’s nerdy. I love it. And importantly for the task at hand, of the 18,000 documents, roughly 6,000 are questions which come with a Tags field. This is precisely what we need to build our Solr/Python-powered document classifier. To give you a little idea of what you’ll find in the dataset, consider document 8723 which we will use further down as our example:

  • Title: Would Star Trek holodecks physically affect you once you exit the Holodeck?
  • Body: Would Star Trek holodecks physically affect you once you exit the Holodeck? Meaning, if someone programmed a holodeck to dump a bucket of water over you head, would you have wet hair (outside later on)?
  • Tags: [star-trek] [holodeck]

… I warned you that this is deep nerdery!

If you wish to follow along, pull down the Solr Sci-Fi git repo. In the README you’ll find the directions for indexing the documents. The only additional setup required is to enable the MoreLikeThis requestHandler within solrconfig.xml

<requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
</requestHandler>

Using Solr’s MoreLikeThis Handler

Now (after having restarted Solr) you can issue MoreLikeThis queries like this:

http://localhost:8983/solr/mlt      #Use MoreLikeThis
    ?q=Id:8723                      #Classify document 8723
    &mlt.fl=Title Body              #Use these text fields for classification
    &fl=Tags                        #Only display the Tags field
    &rows=10                        #Get the 10 most similar documents
    &fq=Tags:*                      #Only retrieve documents that have Tags
    &mlt.interestingTerms=list      #Display statistically interesting words

This query basically identifies which document we intend to classify, specifies which fields to use for classification, and displays all the tags for the 10 most similar documents. Here’s what the results looks like:

<response>
<result name="match" numFound="1" start="0">
  <doc>
    <str name="Tags">star-trek holodeck</str></doc>
</result>
<result name="response" numFound="840" start="0">
  <doc>
    <str name="Tags">star-trek star-trek-tng holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek star-trek-tng holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek</str></doc>
  <doc>
    <str name="Tags">star-trek star-trek-tng technology holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek holodeck</str></doc>
  <doc>
    <str name="Tags">star-trek time-travel</str></doc>
  <doc>
    <str name="Tags">harry-potter diagon-alley magical-transportation</str></doc>
  <doc>
    <str name="Tags">star-trek technology weapon</str></doc>
</result>
<arr name="interestingTerms">
  <str>holodeck</str>
  <str>exit</str>
  <str>affect</str>
  <str>physic</str>
  <str>trek</str>
  <str>star</str>
</arr>

The first section named “match” contains the Tags field of the document that we’re searching for. Obviously, in practice our documents aren’t going to go into the Tagger already pre-tagged (duh), but when building a simple classifier for the purpose of testing the concept, this is the easiest way to go. In production you would post text to the MoreLikeThis handler as a content stream as mentioned in the Solr wiki.

The next section named “response” is the list of the Tags field of the 10 most similar documents. And just as we’d hoped, the most common tag is [star-trek]; and the second most common tag is [holodeck]! Now obviously, this isn’t doing any classification just yet. That’s what we’re going to use Python for soon.

The final section lists all of the statistically interesting terms. Though not strictly necessary, this provides insights into how MoreLikeThis is working and how it may be tuned for better precision or recall. For more information on tuning MoreLikeThis, please again refer to the Solr wiki.

Building and Demonstrating the Actual Classifier

Now for the fun part! Having indexed the documents, and demonstrated the functionality of the MoreLikeThis handler, it’s conceptually straightforward to put it all together. While you can find all of the code required to create in my iPython notebook, the crux of it is here:

def classifyDoc(self,docId,
              method="best" #or "sorted" or "details"
            ):
    #send the MLT query to Solr
    params = {"q": self.idField + ":" + docId,
              "mlt.fl": self.mltFields,
              "fl": self.tagField,
              "fq": self.filterQuery,
              "rows": self.numNearest,
              "wt":"json"
              }
    resp = sess.get(url=self.mltUrl,params=params)

    #Perform error checking
    if resp.status_code != 200:
        raise IOError("HTTP Status " + str(resp.status_code))
    json = resp.json()
    if int(json["match"]["numFound"]) == 0:
        raise RuntimeError("no document with that id")
    if int(json["response"]["numFound"]) == 0:
        raise RuntimeError("no interesting terms in document")  

    #If no errors, then collect and count tags for each similar document
    tagDict = defaultdict(int)
    for tagList in json["response"]["docs"] :
        for tag in tagList[self.tagField].split(' '):
            tagDict[tag] += 1

    #Return the best tag, all of the tags sorted best 
    #to worst, or the list of tags and their count
    if method == "best":
        return max(tagDict, key=tagDict.get)
    elif method == "sorted":
        return sorted(tagDict, key=lambda x : tagDict[x], reverse=True)
    elif method == "details":
        return tagDict

Upon quickly reading though the code, you’ll see that there’s really not much to it. It pulls down the same MoreLikeThis results as listed above, counts up the tags, and returns the most common tag.

But the question remains… does it work? To test this, I first instantiated our SciFi classifier and checked it against our example document above

print c.classifyDoc("8723",method="sorted")

Sure enough, the first two items in the list are [star-trek] and [holodeck]. Great, now lets test this against a larger set. First, though, we need to clearly define what a correct classification entails so that we will understand how to interpret our results. Therefore, given a classifier tag and a list of true tags, we define a correct classification to have occurred whenever the classifier tag is contained with the true tags. (And referring to the iPython notebook, you can see the classifierTester function in which this is all defined.) Running our classifier over all Tagged documents we see the following print out:

4041 out of 5805 correct. That's 69.6124031008%

70%, not bad, especially when you consider that vagaries of some of these tags. And the relatively high possibility that some of these questions are themselves will be tagged wrong to begin with. To get an idea of how accurate the classifier can be, I performed several tests on more constrained document sets. I found that if we only look at the questions for more popular topics (Harry Potter, story identification, Star Trek, and Star Wars), then the accuracy jumps up considerably:

2136 out of 2344 correct. That's 91.1262798635%

Obviously, we can’t know a priori that a given document is in this more popular subset, but this does bring up an important point: k-NN classification will be most accurate for documents that are representative of the training data set.

Conclusions

Easy wasn’t it? But there are still several things that you can do to improve the classifier for your purposes. There is still further information available from the data that we’re not making use of! In particular, it shouldn’t be difficult to extend the current classifier so that it not only classifies the document, but it also conveys information about the certainty of prediction. This would be especially useful in building a human-in-the-loop classifier — basically you allow the computer to classify all the ones that are “easy” and then allow a human to review all the document that the classifier is less certain about.

Tell us what you think! Do you have a classification problem that you’re trying to solve? What is it? Have you found this article helpful? We want to hear about it!


Check out my LinkedIn Follow me on Twitter

solr

post-type:post

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>