Getting Started with Cassandra: Overview

June 25, 2013 OpenSource Connections
Category: Uncategorized

Over the next couple of weeks, we will be posting a 4-part series introduction to Cassandra. This series will take you through all of the basic components of setting up a Cassandra cluster.

This is the first of the 4-part series.

Overview

Cassandra has been in the news a lot recently: Many companies are turning to NoSQL solutions to address their growing data needs.

With any new technology comes fear, uncertainty, and the never-ending barrage of SQL-vs-NoSQL flame wars. These posts seek to address these topics in a way to give you a good sense of what Cassandra is, and why you might want to implement a new technology in your stack.

What is it good for?

How do I set up my data model?

Is Cassandra the silver bullet for all of my data storage needs?

Why is my data model wrong?

Cassandra is not a single-serve solution – it’s highly-specialized for certain use cases. To understand what Cassandra is best at, we need to take a look at why a new database was created in the first place.

History

Cassandra began at Facebook in 2007 as a joint effort between Avinash Lakshman and Prashant Malik to power the new company’s massively-growing inbox search platform. Lakshman, originally from Amazon, had co-invented Amazon’s Dynamo platform for storing distributed data. Thus, Cassandra borrows heavily from the Dynamo model, Google’s BigTable system, and many other distributed systems.

Facebook needed to handle a very high throughput – up to billions of writes per day – while replicating their data across geo-graphically distributed systems to keep search latency down. The system also needed to be able to scale with the growth of users, as Facebook was adding many, many users per day.

Fast, Available, Eventually Consistent

From this came Cassandra, a geographically-distributed, very available, highly-scalable columnar storage system.

Fig 1: A 3-node Cassandra cluster.

It’s constantly available, even with server failures.

Data is replicated across many different nodes: If one goes down, there are many more to take its place.

It handles fast writes.

Because any node can respond to a request, the entire system can handle throughput of tens of thousands of writes per second.

Read speed is configurable, making the data eventually consistent.

Data consistency is adjustable, meaning you decide to focus on speed or consistency.

Performance scales linearly.

To get more performance, add another node. Cassandra was built to handle hundreds of terabytes of data across hundreds and hundreds of nodes.

No Silver Bullet: The Tradeoffs of Using Cassandra

A common misconception is that Cassandra simply replaces extant relational databases. Cassandra is not a relational database. In fact, this is so important, I’ll repeat it: Cassandra is not a relational database.

This means a number of things:

Cassandra doesn’t value ACIDity, it holds an eventually consistent model, so data may be outdated when a query is sent.
Deletes are hard in append-only systems. Data must be continually compacted to ensure tombstones are properly propagate throughout the entire system.
And, of course, JOINs do not exist in Cassandra; operations like this are simply too expensive.

Instead, Cassandra column families (tables) are modeled around the queries you intend to ask.

An Example

To clarify the image, let’s look at a quick example. Say you run a social networking site, and you want to display all of a user’s followers, as well as everyone that they follow.

In a relational world, we set this up using two tables: a User table, and a User_Follower_Link table.

We set up our schema thusly:

  User  user_id     name  1           Patricia  2           John  3           Scott  4           Doug  User_Follower_Link (userA follows userB)  id      userA       userB  x           1           2  x           1           3  x           4           1  x           1           4  x           3           1  x           4           2  x           3           4

So, to find the list of John’s followers, we’d run:

  select * from User  join User_Follower_Link on user_id = userB  where name = 'John';  Finding out who each user follows is similar:  select * from User  join User_Follower_Link on user_id = userA  where name = 'John';

This is fine for small datasets, but what happens when you grow to millions of users? Your simple JOIN will take a very long time to scan through and compare each row, or your database could even time out from the strain.

Modeling this problem in Cassandra is quite simple, and requires two separate column families.

Each user composes a new row, each containing multiple columns. Note that the actual columns need not match.

Users_Followers

  patricia: ['Scott'], ['Doug']  john: ['Scott'], ['Patricia']  scott: ['Patricia']  doug: ['Patricia'], ['Scott']

Users_Following

  patricia: ['John'], ['Doug'], ['Scott']  john:  scott: ['Patricia'], ['Doug']  doug: ['John'], ['Patricia']

In reality, each column comprises a key-pair, a timestamp, and an optional TTL, though it’s been simplified for our model.

Retrieving the data is quite similar using CQL (Cassandra Query Language).

To get a list of John’s followers:

  select * from Users_Followers where name = 'john';

To get a list of who John is following:

  select * from Users_Following where name = 'john';

Notice how the actual data is duplicated across different column families. This is encouraged – data should be optimized for the application, and not the other way around.

Conclusion

Cassandra works very well for a specific set of problem areas. If you have a use case that you think warrants Cassandra, let us know! What are you considering at your company? What tradeoffs are you concerned about?

Next week we will discuss the Cassandra architecture, and go over how each component fits together.