Getting Started with Cassandra: Data Partitioning

September 12, 2013 OpenSource Connections
Category: Uncategorized

Today well talk about how Cassandra partitions data.

When data enters Cassandra, the partition key (row key) is hashed with a hashing algorithm, and the row is sent to its nodes by the value of the partition key hash.

The Old Method

To understand how data is distributed amongst the nodes in a cluster, its best to start with the old method of separating data. Each node was assigned to a range of values, so that the entire cluster covered the possible hashed values from 0 – 2^127-1.

This proved problematic in a couple of ways:

Adding and removing nodes meant that the entire scope of range assignments changed and needed to be reshuffled.
`Hot spots can occur across ranges of data

Node without virtual nodes

Virtual Nodes

Cassandra 1.2 introduced the option of virtual nodes. Now, instead of segmenting data into token ranges to assign to one node, the range of possible data is split up into many, many smaller tokens. Now, each real node gets many smaller tokens, not fewer large tokens.

This minimizes the chances of accidental hotspots, and removes the need for reshuffling upon scale. New nodes are simply assigned a set of tokens from the extant token space, and other nodes re-shuffle smaller pieces to that new node.

Node with virtual nodes

Enabling virtual nodes in your Cassandra cluster is simple – make sure you are running Cassandra > 1.2, and edit your cassandra.yaml:

# This defines the number of tokens randomly assigned to this node on the ring# The more tokens, relative to other nodes, the larger the proportion of data# that this node will store. You probably want all nodes to have the same number# of tokens assuming they have equal hardware capability.#num_tokens: 256# If blank, Cassandra will request a token bisecting the range of# the heaviest-loaded existing node.  If there is no load information# available, such as is the case with a new cluster, it will pick# a random token, which will lead to hot spots.initial_token:

Adjust num_tokens to reflect the capability of the given machine: Heartier machines are assigned more tokens, less-capable ones are given less of a load.

Leave initial_token blank; this lets Cassandra choose the first token. Again, hot spots are not a problem, because virtual nodes minimizes each token space, which minimizes the chance of creating a hot spot.

Next time well talk about the read and write paths of Cassandra.