Detecting Reddit Voting Rings Using This Weird Little Data Trick

A wolf in sheep’s clothing

Wow, so apparently the Reddit community has outed (and as of 15 minutes ago, ousted) colleague Doug Turnbull as the vile spammer that he is! Apparently he has been shamelessly promoting his own blog posts! And among his tactics, the use of voting rings. Yes, and even some of my own colleagues succumbed to his pressure to upvote his sales pitches – including this one masked as a post about test driven search relevancy. (Though I must admit… it is an interesting read… and Doug is an eloquent writer)

But not I. No, I’m as white as the wind driven snow, as pure as a mountain stream, as innocent as a newborn babe. And as a matter of fact, this has all got me thinking: How can I help prevent things like this from happening in the future? In particular, how can I help Reddit prevent voting rings from forming in the future? And then it occurred to me: the HyperLogLog counter.

What is the HyperLogLog counter?

The HyperLogLog counter is a really neat probabilistic data structure that estimates the count of unique elements in a list. For instance, let’s say you have a really popular web page and you want to perfectly count the number of unique visitors. To keep a count of uniques, you have to store every IP address that you ever see. And upon receiving a new IP address, you have to first check that the new IP address has not been run across before, and only then do you increment the site counter. Under the best of situations, the storage and the computation probably scale as O(log(n)). (What would you use? Tries? Skip-lists?)

With the HyperLogLog counter it’s all different. The size of the data structure is determined a priori (thus O(1)) and is a miniscule fraction of whatever data structure you would use from in the example of the previous paragraph. Data insertion and count tallying are super fast and also scales as O(1). But here’s the catch, you can’t actually know the exact count of unique views. Rather, you may only have an estimate of the current count. I encourage you to read more. This article has been the best resource I’ve found. And make sure you try their cool JavaScript demo which really helps to understand how it all works.

But how can the HyperLogLog counter help beat voting rings??

Let’s turn back to the sad example of disgraced colleague Doug Turnbull. Usually when Doug coerces others to upvote his posts he uses forms of extortion or even threats of physical harm. Naturally I’ve never succumbed to this, but those that do go to Doug’s most recent post (many of which can be found here) and upvote.

But consider what would happen if we created a HyperLogLog counter for every user on Reddit, and any time that a user receives an upvote, we update the corresponding HyperLogLog counter with the id of the user who did the upvoting. Given this setup, here’s how we detect the voting ring. First, for a given user, retrieve the estimated count of unique upvotes from that user’s HyperLogLog counter and let’s label this quantity as estimated_unique_upvotes. Next, tally up the total number of actual upvotes…


these things… Screen Shot 2013-10-15 at 12.23.05 AM
across all of that user’s posts. Let’s call this quantity as

global_total_upvotes. The probability that a given user is involved in a voting ring will correlate highly with the ratio of these numbers. In other words we can define

voting_honesty  =  estimated_unique_upvotes / global_total_upvotes

Think about it. If a user is not in a voting ring, then their upvotes are going to be organically generated from anyone one the world-wide-web that thinks their links are interesting. Because the internet is a big place, this user’s estimated_unique_upvotes and global_total_upvotes will tend toward the same number and their voting_honesty “probability” will tend toward 1. On the other hand, a user wrapped up in seedy underworld of Reddit voting rings will have many more global_total_upvotes across all posts than they have estimated_unique_upvotes. For this user, the voting_honesty will tend toward 0.

Conclusion

Now we’re talking 1′s and 0′s. Aren’t computer’s supposed to be good with those? Yes! So the method I’ve outlined above should be a scalable way to automatically detect and expel those voting charlatans! Are there issues with this strategy? Sure. But I do think that the estimated_unique_upvotes to global_total_upvotes ratio is interesting. Help me think; where could this be used outside of Reddit voting rings?

So what now? Well it’s obvious, right? Upvote this article!

Update Idea: Throwing Away 90% of Reddit’s Infrastructure and Using HLL Approximations Instead.

Hey, I got another idea! I wonder what reddit’s infrastructure looks like? It’s gotta be hell to scale everything up. The most obviously complex thing is tracking all those up and down votes. Furthermore, the most important function of the infrastructure is in making sure that no one votes for a post more than once – otherwise colleague, Doug Turnbull, would cramp to death setting at the computer and click-upvoting his own posts! Since it’s so hard to scale, why not just rip all that stuff out and replace it with HLL approximations instead.

Here’s how: Each post gets 2 HLL counters, one positive votes and one negative. When a user upvotes or downvotes a post, their id is submitted to the corresponding HLL counter. There is never a chance that a vote can be counted twice, and there’s no need for any of the infrastructure that ensures that no one votes twice.

But there are a couple of obvious drawbacks to this approach. One – the vote is now just an estimate – a guess, and while one average this would work out fine, there would certainly be crappy posts getting promoted and great posts getting penalized. Secondly, a side effect of all this click tracking infrastructure is that you lose the ability to trace your own history! And if you again built infrastructure so that you could see your history, you might as well return back to the way Reddit currently does it. But at the very least this idea is worth remembering when building and scaling a website that has, but doesn’t feature, votes.


Check out my LinkedIn Follow me on Twitter

Analytics, Big Data

post-type:post

4 comments on “Detecting Reddit Voting Rings Using This Weird Little Data Trick

  1. John,
    Interesting idea indeed. I think you would have to look at the actual data for reddit upvotes to see how well this works in practice. The biggest issue I see is that for extremely random upvoters this value is 1 and for tight knit spam networks this value is 1/# of posts. My guess is that interesting posts on reddit are often upvoted by a smallish community of people, say /r/programming so that “good” posts don’t have a value close to 1 but more in the middle. We would have to see if this “middle of goodness” is close to the 1/# of posts value of spam networks. My guess is the error and area of detectability is small, but you never know. Cool idea!

  2. Hey Matt, thanks for your feedback. Yeah… I basically agree. There are issues with the idea as stated. You might want to check out my addition at the bottom. It’s another really cool idea… that’s just not quite there. There’s gotta be something cool to do with this data structure!

  3. One issue with this, any group of followers will look like a voting ring. That is, given any publisher’s unique focus, they will attract a set of followers who like to read their posts. That group won’t mimic the behavior of the global set of reddit users as they are following (and upvoting) a based on what they like… So they will appear to be a voting ring even though they aren’t organized in that fashion.

    Worse yet, as a publisher you want to attract a set of passionate followers who follow you and enjoy the content you create. This algorithm would punish that by making that seem nefarious.

    So, it’s a neat idea and would definitely show a “voting ring”. However, the difference between a voting ring and passionate followers is completely opaque to this algorithm…

  4. @Joe I think your points are quite correct. Though it would be interesting to represent the voting_honesty ratio as a Cauchy distribution (which presumes that both the numerator and denominator our Gaussian). And then you could identify any outliers. … Granted, though, outliers might just be a crazy fan following. :-/

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>