These are the two high-level goals for your data model:
- Spread data evenly around the cluster
- Minimize the number of partitions read
Rule 1: Spread Data Evenly Around the Cluster
You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.
Rule 2: Minimize the Number of Partitions Read
Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.
Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.
Conflicting Rules?
If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.
The point is, these two goals often conflict, so you’ll need to try to balance them.