Finally! A Hadoop Hello World that isn’t a Lame Word Count!

So I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World in which I coaxed Hadoop into transposing a matrix!

What? What’s that you say? You think that a matrix transpose MapReduce is way more lame than a word count? Well I didn’t say that we were going to be saving the world with this MapReduce job, just flexing our mental muscles a little more. Typically, when you run the WordCount example, you don’t even look at the java code. You just pat yourself on the back when the word “the” invariably revealed to be the most popular word in the English language.

The goal of this exercise is to present a new challenge and a simple challenge so that we can practice thinking about solving BIG problems under the sometimes unintuitive constraints of MapReduce. Ultimately I intend to follow this post up with exceedingly more difficult MapReduce problems to challenge you and encourage you to tackle your own problems.

So, without further adieu:

The Matrix Transpose Problem

The matrix transpose is a pretty simple concept. Let’s say that you have some matrix M. Here are the rows of M (preceded by the row number):

0    7 9 3 6
1    4 2 9 8
2    4 6 6 1

You transpose this matrix by “flipping” the values about the diagonal. Here is transpose(M):

0    7 4 4
1    9 2 6
2    3 9 6
3    6 8 1

For tiny matrices like this, transpose is trivial, but for giant, super-jumbo Big Data matrices this can be challenging to do in the constraints of one machine’s RAM. Thus, the matrix transpose is a good candidate for MapReduce.

Outline of a Simple MapReduce Job

As you may know by now, MapReduce is a two phase process composed of a mapping phase and a reducing phase. Simplistically, in the mapping phase, the mapper is given the contents of a Hadoop directory one bit at at time as a key-value pair. In this example, let the key be the rowIndex and the value be the values associated with that row.

{0: [7 9 3 6]}
{1: [4 2 9 8]}
{2: [4 6 6 1]}

It is the goal of the mapper to consume this information, process it, and then emit a number of other key-value pairs K and V. The keys and values can be of any type that you like, and for every iteration of the map function you can emit as many {K: V} pairs as you wish.

Between the mapping phase and the reducing phase, all of these values V are grouped according according to their associated keys K. Then, one at a time, the reducer is given a key K, along with all of the corresponding values V[]. The goal of the reducer is then to consume this information, process it, and emit more key-value pairs. In the case of our example, this new set of key-value pairs will represent the transpose of the original matrix. The keys will be transposeRowIndexs and the associated values will be the associated elements in that row which we refer to as transposeValuess. So, for the example of this post, the reducer will return:

{0: [7 4 4]}
{1: [9 2 6]}
{2: [3 9 6]}
{3: [6 8 1]}

The goal of the game, then, is simply to write two functions, map and reduce, which achieve all of the above:

map(rowIndex,values) :
  #DO STUFF
  return {K:V}

reduce(K,V[]) :
  #DO STUFF
  return {transposeRowIndex: transposeValues}

So, how do you implement this? Think about it! No peeking.

My Solution

Here is my solution

map(rowIndex,values) :
  for(columnIndex = 1 to length(values)) :
    K = columnIndex
    V = {rowIndex: values[columnIndex]}
  return {K:V}

reduce(K,V[]) :
  transposeValues = []
  for(x in V[]) :
    transposeValues[x.key] = x.value
  transposeRowIndex = K
  return {transposeRowIndex: transposeValues}

And if we run our MapReduce algorithm, during the mapping phase we get

map(0,[7 9 3 6]) #=> {0:{0,7}},  {1:{0,9}},  {2:{0,3}},  {3:{0,6}}
map(1,[4 2 9 8]) #=> {0:{1,4}},  {1:{1,2}},  {2:{1,9}},  {3:{1,8}}
map(2,[4 6 6 1]) #=> {0:{2,4}},  {1:{2,6}},  {2:{2,6}},  {3:{2,1}}

In the “shuffle and sort” time before the reduce phase we cluster the values together according to key

{0:   [{0,7},  {1,4},  {2,4}]}
{1:   [{0,9},  {1,2},  {2,6}]}
{2:   [{0,3},  {1,9},  {2,6}]}
{3:   [{0,6},  {1,8},  {2,1}]}

And finally, during the reducing phase we get

reduce({0: [{0,7},{1,4},{2,4}]})  #=>  {0: [7 4 4]}
reduce({1: [{0,9},{1,2},{2,6}]})  #=>  {1: [9 2 6]}
reduce({2: [{0,3},{1,9},{2,6}]})  #=>  {2: [3 9 6]}
reduce({3: [{0,6},{1,8},{2,1}]})  #=>  {3: [6 8 1]}

And now since we’ve tested out the algorithm, it’s time to transpose something a bit larger…

In Practice

I’m putting together a collection of Hadoop tutorials and I’m calling it Hadoopadoop. Check it out! Once you download it, you can quickly run my MatrixTranspose example by issuing the command ./build.sh MatrixTranspose. But please don’t let this just be like running the typical WordCount example. Instead, take a look at the build script and see what it’s doing to build the project. I’m intentionally keeping everything as lightweight as possible. No extra libraries, no Maven, just some java files, some data zips, and a little bit of bash wrangling.

The actual MatrixTranspose/MatrixTranspose.java code is pretty simple. You’ll find that I’m not using any bells and whistles to get the job done. Besides a little extra code to parse and assemble the input and output text, I’m basically stating in code the same thing as I have stated above. However, there are still plenty of things to improve upon. This follow-up article describes such improvements as using a creating a custom combiner to reduce the network traffic during the shuffle/sort phase.

While you’re looking at my Hadoopadoop, take a glance at the WordCount tutorial that I’ve adapted from the WordCount example that ships with Hadoop. Lame though it may be, it’s still remains a great, and simple example of MapReduce.


Check out my LinkedIn Follow me on Twitter

solr

post-type:post