Blog

Using Solr Join to find the best time to ask questions on StackOverflow

So lets say that you have an important tech question that simply must be answered:

“What’s the difference between JavaScript and Java?”

Normally you would post it on StackOverflow and add a hefty bounty to get it answered fast. But, youve posted a bounty on the past 10 questions and now your Stack Overflow reputation is 4.

Dont fret, perhaps if you just time your question correctly you can catch all those Java/JavaScript programmers right when theyre answering important questions like yours. And how do you figure out just when that magic time is? Simple, you index the entire StackOverflow data dump into Solr and treat Solr as a StackOverflow analytics engine. (Hey, you may not know the difference between Java and JavaScript, but youre nobodys fool when it comes to Solr!)

So heres what this looks like. The post.xml file in the StackOverflow data dump contains all the questions and answers on the site. Posts contain the following fields:

  • Id – Unique id for a question or answer.
  • ParentId – If this post is an answer, ParentId refers to the corresponding question.
  • PostTypeId – 1 for a question, 2 for an answer.
  • CreationDate – In Greenwich Mean Time.
  • Body – The contents of the post.
  • Title – You guessed it.
  • Tags – A list of topics for this question.

In order to slice and dice the data to find the best time of year, day of week, or time of day to answer a question its a good idea to break up the CreationDate into a set of related fields:

  • CreationMonth – 1 through 12.
  • CreationHour – 0 through 23.
  • CreationMinute – 0 through 60.
  • CreationDayOfWeek – 0 (Monday) through 6.
  • CreationDayOfYear – 1 through 365.

Now all you have to do to find out that golden time for asking a question is to find the times when most people are answering questions about Java AND Javascript.

http://localhost:8983/solr/collection1/select?q=Tags:(java AND javascript)&fq=PostTypeId:2&facet=on&facet.field=CreationDayOfYear&f.CreationDayOfYear.facet.limit=365&facet.field=CreationDayOfWeek&facet.field=CreationHour&facet.sort=index

In words, the query q is for all questions tagged with both java and javascript. These results are filtered fq so that only answers are returned. The remainder of the parameters turn on sorted facet lists for times of the year, week, and day. So, as soon as you get query Solr, youll know the best times of the year, week, and day to ask your questions. You press enter and SNAP no results! What gives?!

After a little research it turns out that only questions (PostTypeId=1) have the Tags field – so obviously you can not get a count of the answers tagged with Java AND JavaScript. So are you sunk? Is there no way to find out when the Java/JavaScript questions are getting all the attention? Are you going to have to do some crazy MapReduce indexing job to associate answers with their corresponding tags? It turns out no!

Solr Join to the Rescue

Thats right, Solrs Join functionality is a perfect fit for this particular problem. Lets take a look at how this would work:

http://localhost:8983/solr/collection1/selectq={!join from=Id to=ParentId}Tags:(java AND javascript)&facet=on&fq=PostTypeId:2&facet.field=CreationDayOfYear&f.CreationDayOfYear.facet.limit=365&facet.field=CreationDayOfWeek&facet.field=CreationDayOHour&facet.sort=index

As you can see, the only difference here is strange notation at the front of the q parameter.

{!join from=Id to=ParentId}

That is Solrs local parameter notation, and heres what its telling Solr to do: First you have join this is actually syntactic sugar for the first parameter only. Its the same thing as saying type=join. This means that instead of using the lucene or dismax query mode, we will be using the join query mode. Next we have from=Id. To put this in SQL terms, this means that we will be be using Id as the primary key. Finally we have to=ParentId which, as you might have guessed, implies that ParentId will be used as the foreign key.

When we issue the query, Solr first retrieves a list of documents matching the query Tags:(java AND javascript). Then, for every document in that result set, Solr retrieves the set of documents that have a ParentId corresponding to the Ids in the original set.

In SQL world, this query would look like this:

SELECT *FROM collection1WHERE ParentId IN (SELECT Id FROM collection1 where Tag = "(Java and Javascript")

And now as soon as you you issue the query, you get the following Solr response:

    2    4    12    5    10    8    2        9    3    4    5    63    44    122    65    120    48    62        29    63    434    371    390    383    422    369    266    272

You can imagine how this data could easily be used to build visualizations of the best times to query Stack Overflow for your particular topic. And actually, we are in the process of building such a visualization capability right now. See Patricias new post for an example.

Also, if youre interested in playing with this yourself, check the repo on GitHub.


Check out my LinkedIn Follow me on Twitter