Cassandra – Data model – Basic Goals

These are the two high-level goals for your data model:

  1. Spread data evenly around the cluster
  2. Minimize the number of partitions read

Rule 1: Spread Data Evenly Around the Cluster

You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.

Rule 2: Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.

Conflicting Rules?

If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.

The point is, these two goals often conflict, so you’ll need to try to balance them.

cassandra – More number of writes, data duplication and denormalization

Number of writes can be more

Writes in Cassandra aren’t free, but they’re awfully cheap.

Cassandra is optimized for high write throughput, and almost all writes are equally efficient [1].

If you can perform extra writes to improve the efficiency of your read queries, it’s almost always a good tradeoff.

Reads tend to be more expensive and are much more difficult to tune.

Denormalization and Data Duplication is Normal

Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.