Blog

Quick Start with Neo4J using YOUR Twitter Data

When learning a new technology its best to have a toy problem in mind so that youre not just reimplementing another glorified “Hello World” project. Also, if you need lots of data, its best to pull in a fun data set that you already have some familiarity with. This allows you to lean upon already established intuition of the data set so that you can more quickly make use of the technology. (And as an aside, this just why we so regularly use the StackExchange SciFi data set when presenting our new ideas about Solr.)

When approaching a graph database technology like Neo4J, if youre as avid of a Twitter user as I am then POOF you already have the best possible data set for becoming familiar with the technology – your own Social network. And this blog post will help you download and setup Neo4J, set up a Twitter app (needed to access the Twitter API), pull down your social network as well as any other social network you might be interested in. At that point well interrogate the network using the Neo4J and the Cypher syntax. Lets go!

Installing and setting up Neo4J

Since were not setting Neo4J up for production use, this parts real easy. Just go to the Neo4J download page, click on that giant blue download button, and 36.1M later youll have your very own copy of Neo4J. Unzip it to some reasonable place on your machine, cd into that directory, and simply issue the command bin/neo4j start. (Once youre finished, a bin/neo4j stop will shut Neo4J down.) Now if you point your browser at http://localhost:7474 and see stuff (rather than lack of stuff), then youre ready to start shoveling data into Neo4J.

Prepping Twitter

Youll need to create a Twitter app before you can start pulling down your connections because you need the apps credentials in order to access Twitters API. But dont sweat it, this literally takes less than a minute. Just go to the Twitter developer apps page, sign in, and there will be yet another big blue button, this time labeled “Create a new application” – click it! After filling out a really short form, checking the “I blindly agree to whatever is included in this legal contract” checkbox, entering a CAPTCHA string, and clicking the “Create your own Twitter application” button, you will indeed have your very own Twitter app. Youll be taken to a screen that contains the details for your new app, but most importantly the OAuth credentials. Initially, you wont have the access tokens, but you can click the “Create access tokens” button at the bottom and next time you refresh the page (wait a few seconds) youll see that the access keys are available. Keep track of the credentials here because youll need to refer to them soon.

Scraping Your Social Circles from Twitter

Check out my Python TwitterScraper script. Though its not yet the most beautiful code, it doesnt really matter, because theres not much here! Lets take a moment to walk through it. The first section is where you set up Twitter and Neo4J. Naturally youll need to pip install the Tweepy and Py2Neo libraries, but they dont have any weird dependencies, so this shouldnt be a problem. Also notice, this is where all the access keys for your Twitter app should be used. Go ahead and copy and paste your credentials there. Now you should be ready to go.

The remaining code includes two functions. The first, create_or_get_node, creates, or gets a node (in this case a Twitter user) from Neo4J by id_str, and if its creating the node for the first time, it also inserts all of the relevant user metadata into Neo4J. Also, the create_or_get_node optionally takes a list of labels that will later be used to group certain users together. The second function. insert_user_with_friends, takes a Twitter user (via their screen name), pulls that all relevant metadata for that user from the Twitter API and inserts it into Neo4J. This function will then do the same thing for all the individuals that this Twitter user follows. And finally, insert_user_with_friends will establish a FOLLOWS relationship linking the source Twitter user to those that she follows. Again here, insert_user_with_friends takes an optional list of labels that can be used to group the seed nodes (those that are followed do not get labeled).

The last bit of the script is the fun part. This is where you programmatically lay out the social networks and individuals that you want to stalk… er, uh… observe. For your convenience, Ive added all of the OpenSource Connections team, as well as several notable individuals from the Neo4J community. Ive also included grouping labels that I though were pretty reasonable descriptors for these individuals and groups. As that last comment in the code states, make sure to add several people that you follow as well. Remember, the goal here is to create a data set that you are eminently familiar with. Once youre happy with the data set, the run it: python TwitterScraper.py. It will pull down twitter users 200 at a time and insert them into Neo4J as fast as possible. Soon the program will hit Twitters rate limit cutoff, at which point, the script will wait until the rate limit has been lifted and will continue pulling down the rest of the data. All together, you can plan on getting around 200 updates per minute.

Start Infiltrating the Social Network!

Now for the fun part; lets start putting some queries together and pulling back interesting data. In all of the examples below, we will be using the default Neo4J browser which youll still find at http://localhost:7474/. Heres were using the Cypher query language. This blog post wont go into too much detail about Cypher syntax itself, but feel free to look at the very rich Neo4J documentation. Also, Ill be using my own Twitter screen name “JnBrymn” as an example, so feel free to replace my screen name with your own and try the queries for yourself.

First off, lets make sure the data weve ingested seems reasonable. The most obvious thing to do is to make sure were actually in the data set:

MATCH (n {screen_name:"JnBrymn" })RETURN n

Up pops an orange node representing me. And if I click on the node, I see a list of all my metadata.

Screen Shot 2013-11-27 at 12.57.12 AM

I wonder just how many users we have indexed now?

MATCH (n)RETURN count(*)

7098 users, not bad. How many are you following?

MATCH (n {screen_name:"JnBrymn"})-[:FOLLOWS]->(o)RETURN count(*)

371 – yep, that looks right. And check out how easy Cypher is – youre basically drawing ASCII art of the node connections. So its easy to ask the next obvious question: How many are following me? Here I just switch the direction of the relationship arrow:

MATCH (n {screen_name:"JnBrymn"})

Hmm… only 10 followers. Am I really that unpopular? (Checking Twitter now.) No, says Ive got 460 friends. Oh, thats right, if youll remember, were only collecting outbound FOLLOWS relationships from our seed users (labeled as SeedNode). The reason for this is because some people, Justin Beiber for example, are followed by millions of Twitter users! And we certainly dont want to keep track of that for now.

But all this makes me think, of the seed users that I follow, who does not follow me back?

MATCH (n {screen_name:"JnBrymn"})-[:FOLLOWS]->(o:SeedNode)WHERE NOT (o)-[:FOLLOWS]->(n)RETURN o.screen_name

This returns a single name: mesirii. This is Michael Hunger, one of the Neo4J hot shots. If hes not following me back, then Im definitely not doing a good job of infiltrating the Neo4J community yet. No matter… I bet hes a @justinbeiber follower anyway… lets check:

MATCH (n:SeedNode)-[:FOLLOWS]->(o {screen_name:"justinbieber"})RETURN n.screen_name

Sadly… no one on our list follows Justin Bieber… I was sure I would have some good blackmail fodder there! (But hey, maybe youll discover some Beliebers in your own data set 😛 )

Hmm… well if Im going to break into the Neo4J community, I need to find my likely vectors. Lets create a list of all people who follow me and order them by the number of Neo4J people that they follow. Maybe I can get introductions through these friends:

MATCH (n:Neo)-[:FOLLOWS]->(m:SeedNode {screen_name:"JnBrymn"}),      (n)-[:FOLLOWS]->(o:Neo)RETURN count(*), n.screen_nameORDER BY count(*) descLIMIT 10

This returns:

count(*) |  n.screen_name---------+---------------13       |  wefreema11       |  technige

Sweet, so my friends wefreema and technige look like my gatekeepers to the Neo4J community. The only thing left to determine is what people I need to connect to.

MATCH (n:Neo)-[:FOLLOWS]->(o)RETURN count(*), o.screen_nameORDER BY count(*) descLIMIT 10

This query enumerates the most popular people among the Neo4J community based upon who my Neo seed nodes are following. And the results of this query look like this:

count(*) |  n.screen_name---------+---------------13       |  mesirii12       |  emileifrem12       |  jimwebber12       |  digitalstain11       |  apcj11       |  cleishm11       |  pandamonial11       |  iansrobinson11       |  p3rnilla11       |  neo4j

As expected, plenty of these people are SeedNodes that I selected because I already knew them to be leaders in the community: mesirii, emileifrem, jimwebber, p3rnilla, neo4j. But who are these guys: digitalstain, apcj, cleishm, pandamonial, iansrobinson? After quickly looking them up on Twitter, I think weve discovered some new, key players in the Neo4J space.

Conclusion

This is only an intro to Neo4J. There are plenty of things that we could have talked about here: I could have gone into much more detail about the Cypher query syntax, I could have added indexes to speed up query times, and I could have put together some even crazier Cypher queries that make use of the broader Cypher syntax. But this is a good start. I think that youll agree: by looking at your own Twitter social graph, youll immediately think of questions that you want to ask and youll get a better understanding of what possibilities are out there.

Want to learn more about Cypher? Well I might just be co-authoring a book on that very subject! Stay tuned.

Update – Crowdsourcing a Collection of Key Community Figures

Apparently some people are already using this post to search through their own communities of interest. Lets help each other out. If youre tracking a community, then comment below with the Twitter screen names of the key figures from the community. Ill edit the comments later to coalesce clean lists.


Check out my LinkedIn Follow me on Twitter