Using Cassandra to Build a Naive Bayes Classifier of Users Based Upon Behavior

In our last post, we found out how simple it is to use Cassandra to estimate ad conversion. It’s easy, because effectively all you have to do is accumulate counts – and Cassandra is quite good at counting. As we demonstrated in that post, Cassandra can be used as a giant, distributed, redundant, “infinitely” scalable counting framework. During this post will take the online ad company example just a bit further by creating a Cassandra-backed Naive Bayes Classifier. Again, we see that the “secret sauce” is simply keeping track of the appropriate counts.

In the previous post, we helped equip your online ad company with the ability to track ad conversion rates. But competition is steep and we’ll need to do a little better than ad conversion rates if your company is to stay on top. Recently, suspicions have arisen that ads are often being shown to unlikely customers. A quick look at the logs confirms this concern. For instance, there was a case of one internet user that clicked almost every single ad that he was shown – so long as it related to the camping gear. Several times, he went on to make purchases: a tent, a lantern, and a sleeping bag. But despite this users obvious interest in outdoor sporting goods, your logs indicated that fully 90% of the ads he was shown were for women’s apparel. Of these ads, this user clicked none of them.

Let’s attack this problem by creating a classifier. Fortunately for us, your company specializes in two main genres, fashion, and outdoors sporting goods. If we can determine which type of user we’re dealing with, then we can improve our conversion rates considerably by simply showing users the appropriate ads.

Naive Bayes Classifiers

With this goal in mind, let’s look at the theory behind Naive Bayes Classifiers so that we can build our own. The purpose of a classifier is to identify which group a sample belongs to based upon the given evidence. In this case, our “sample”, is an individual user, and based upon the evidence of which ads she clicks, we wish to identify which group she belongs to: fashion or outdoors. To put some math to the problem, consider the following question:

What is the probability that user is from group G given the fact that this user has clicked on ads A, B, and C?

To put this into equation form, we can write:

\displaystyle P(G|A,B,C)

This function returns a probably, a number from 0 to 1, representing how likely it is that this user is from a particular group based upon the fact that they have clicked on these ads. The goal, then, is to evaluate this equation with each group and then find which group leads to a bigger result. But how do you evaluate this equation? Fortunately for us, Thomas Bayes, a clergyman from the 18th century provided an answer in the form of Bayes’s equation:

\displaystyle P(G|A,B,C,\cdots) = \frac{P(G)\times P(A,B,C,\cdots|G)}{P(A,B,C,\cdots)}

Here’s we’ve turned one probability into the function of three separate probabilities:

  • P(G) – the probability, in the absence of any evidence, that a user is from a particular group – this is called the prior
  • P(A,B,C,\cdots|G) – the probability that a user from group G will have clicked ads A, B, C, etc.
  • P(A,B,C,\cdots) – the probability that a user from any group will click ads A, B, C, etc.

This looks a little confusing, but bear with me a moment and we’ll see how this allows us to solve our classification problem. Let’s first look at the probability P(G). We happen to know that both the fashion group and the outdoor group are about equally strong, so for simplicity sake, we assume that P(\textrm{fashion}) = P(\textrm{outdoors}) = 0.5. But remember, we ultimately intend to identify the group which maximizes this equation. Since P(G) = 0.5 in both cases, it does not affect the outcome and can safely be disregarded. Next up, P(A,B,C,\cdots). We could estimate the probability that users click on particular groups of ads, but here again we’re looking for the group that maximizes this above equation, and since the value of P(A,B,C,\cdots) is not a function of the group in consideration, this component remains constant across all groups and can also be safely be disregarded.

The only piece left is P(A,B,C,\cdots|G), the probability that a user from group G will click on ads A, B, C, etc. Since this piece is a function of G, we can not disregard it, so we must somehow compute it. And our goal, again, is to find the group G which maximizes this probability. … But we have a problem. This particular probability, as stated, can not be computed. It’s intractable. It’s mathematically infeasible to gather enough information to estimate the probability that a member of group G will click on any particular set of ads. So we do what any good applied mathematician will do when hitting a wall like this, we’ll make a simplifying assumption. If we assume that ad clicks are completely independent from one another, then we can deal with them each separately. Thus:

P(A,B,C,\cdots|G) \approx P(A|G)\times P(B|G)\times P(C|G)\times \cdots

Here, each piece, P(A|G), P(B|G), P(C|G), etc., is actually quite simple to estimate. And though this assumption might be a bit naive — this is, after all, reason that this classifier is called the Naive Bayes Classifier — the resulting classifier has empirically been show to work quite well across a wide range of applications and even in certain cases where this assumption is not only naive, but actually quite wrong.

Finally, we have arrived at something we can deal with. Let’s take a moment to recap: In order to find the most likely group G that a user belongs to based upon their ad clicks A, B, C, etc., we must find which group maximizes the equation:

P(G|A,B,C,\cdots)

But after using Bayes’s equation and discarding some unnecessary pieces we recognize that we can determine the most likely group by finding the group which maximizes this function:

P(A,B,C,\cdots|G)

And finally, after making the simplifying assumption regarding the independence of clicks, we see that determining the most likely group for the user based upon ad clicks is as simple as finding the group which maximizes this equation

P(A|G)\times P(B|G)\times P(C|G)\times \cdots

.

Building the Classifier using Cassandra

Based upon the equation above, for each ad A and for both of the groups G we need to keep estimates of the probabilities P(A|G). To make a definitive determination of which group a user belongs to, we wait until a user makes a purchase. Based upon which of our customers he buys a product from, for instance Zappos vs. REI, it’s easy to determine whether this user is in the fashion group or the outdoor group. As an example, consider the user that we alluded to earlier — the person that purchased the tent, lantern, and sleeping bag. He’s obviously an outdoors person. What’s more, since we can track each user’s click history, we know which ads that he has clicked in the past. And we can use counts of these ads to estimate the probability, that an outdoors person will click on an ad for hiking boots – P(\textrm{HikingBootsAd}|\textrm{Outdoor}).

Here’s how we do this using Cassandra. First, we create a table:

CREATE TABLE probability_of_view_given_group ( 
  ad text,
  group text,
  count counter,
  PRIMARY KEY(ad,group) 
);

Then, whenever a user makes a new purchase, we establish their group, fashion or outdoors, based upon the customer they purchased from. Next, we scoop together all of the ads that this user has clicked and for each ad we make two updates. Let’s demonstrate this with HikingBootsAd for the user above:

UPDATE probability_of_view_given_group
  SET count=count+1
  WHERE ad='HikingBootsAd'
  AND group='ANY_GROUP';

UPDATE probability_of_view_given_group
  SET count=count+1
  WHERE ad='HikingBootsAd'
  AND group='HikingBootsAd';

The first update keeps track of the number of ad clicks no matter the group; you make this update whether or not the user in question is an outdoors enthusiast or a fashion aficionado. The second update is the group specific count. Once you have these two values, then you have all you need to estimate the requisite probabilities. Again in the case of the HikingBootsAd, the probability that someone who clicks on this ad is a outdoors person is roughly equal to the number of outdoor people who have clicked the HikingBootsAd in the past divided by the total number of times that any product purchaser has clicked the HikingBootsAd. In equation form:

\displaystyle P(\textrm{HikingBootAd}|\textrm{Outdoor}) \approx \frac{N(\textrm{Outdoor purchasers who click HikingBootAd})}{N(\textrm{Any purchasers who click HikingBootAd})}

Putting it all together

Let’s see it in action. A user has just opened a web page that belongs to one of your ad affiliates and it’s time to serve up an ad. A quick check reveals that this user has looked at three ads in the past couple of days: LittleBlackDressAd, CuteHighHeelsAd, and most recently HikingBootAd. Here’s how we automatically determine which group this user is in so that we can serve them an ad appropriate to their tastes:

First we grab all available data for the ads in question:

SELECT * FROM probability_of_view_given_group
  WHERE ad='LittleBlackDressAd';

         ad         |   group   | count
--------------------+-----------+-------
 LittleBlackDressAd | ANY_GROUP | 11843 
 LittleBlackDressAd | Outdoors  | 142
 LittleBlackDressAd | Fashion   | 11701

SELECT * FROM probability_of_view_given_group 
  WHERE ad='CuteHighHeelsAd';

       ad        |   group   | count
-----------------+-----------+------- 
 CuteHighHeelsAd | ANY_GROUP | 54127 
 CuteHighHeelsAd | Outdoors  | 53 
 CuteHighHeelsAd | Fashion   | 54074

SELECT * FROM probability_of_view_given_group
  WHERE ad='HikingBootAd';

      ad      |   group   | count
--------------+-----------+-------
 HikingBootAd | ANY_GROUP | 71534
 HikingBootAd | Outdoors  | 63241
 HikingBootAd | Fashion   | 8293

Next we calculate the requisite probabilities by dividing the Outdoors and Fashion counts by the ANY_GROUP counts.

P(\textrm{LittleBlackDressAd}|\textrm{Outdoors}) = 142/11843 = 0.01199 P(\textrm{LittleBlackDressAd}|\textrm{Fashion}) = 11701/11843 = 0.98801

P(\textrm{CuteHighHeelsAd}|\textrm{Outdoors}) = 53/54127 = 0.00098 P(\textrm{CuteHighHeelsAd}|\textrm{Fashion}) = 54074/54127 = 0.99902

P(\textrm{HikingBootAd}|\textrm{Outdoors}) = 63241/71534 = 0.88407 P(\textrm{HikingBootAd}|\textrm{Fashion}) = 8293/71534 = 0.11593

As is often the case when you start tracking metrics like this, you find other interesting facts as a byproduct of your work. In this case we see that the HikingBootAd is surprisingly popular among the fashionistas.

Finally, recall once more that the classification we choose is based upon whichever group maximizes the equation P(A|G)\times P(B|G)\times P(C|G)\times \cdots. Therefore we apply this equation to both groups:

P(\textrm{LittleBlackDressAd}|\textrm{Outdoors}) \quad \times P(\textrm{CuteHighHeelsAd}|\textrm{Outdoors}) \quad \times P(\textrm{HikingBootAd}|\textrm{Outdoors}) \quad = 0.01199 \times 0.00098 \times 0.88407 \quad = 0.00001

P(\textrm{LittleBlackDressAd}|\textrm{Fashion}) \quad \times P(\textrm{CuteHighHeelsAd}|\textrm{Fashion}) \quad \times P(\textrm{HikingBootAd}|\textrm{Fashion}) \quad = 0.98801 \times 0.99902 \times 0.11593 \quad = 0.11443

Based upon these calculations, it is roughly 11,000 times more likely that this person belongs to the fashion group than to the outdoor group. Based upon this classification, we confidently serve up an ad for a cute pink designer purse. As it turns out, this user goes on to purchase the purse. Because she made a purchase from one of our fashion clients, we now know that this user is in the fashion group, and dutifully, we complete the circle of analysis, by collecting the ads that she has viewed and using them to update our counts in Cassandra. Note that this does include another count in favor of that HikingBootAd being attractive for members of the fashion group.

Conclusion

Now that I’ve concluded my sketch for building a Cassandra-based user classification engine, should you go out and pull this all into production? Nah… probably not. As a matter of fact, there are a few issues with the application as I’ve built it up here. For one, just because a person makes one outdoor purchase, this doesn’t mean that forever and always they will be the quintessential outdoorsman. Indeed, in our example here we have identified a particular pair of hiking boots that the fashion folks seem to fancy. Then on the Cassandra side of things, we are required to make several queries before serving up a each individual ad. While Cassandra is quite performant with reads, Cassandra most excels when it comes to quick writes and the example use case outlined in this post is a bit read heavy.

However, Cassandra is still going to be quite performant for accumulating the various counts necessary to implement the Naive Bayes Classifier. There are countless possible applications for such a classifier, and so long as your particular use case does not require an overwhelming number of queries, Cassandra might just be a great tool for your application.


Check out my LinkedIn Follow me on Twitter

solr

post-type:post