cassandra – column type as collection

Defining a column 

You assign columns a type during table creation. Column types, other than collection-type columns, are specified as a parenthesized, comma-separated list of column name and type pairs.

This example shows how to create a table that includes collection-type columns: map, set, and list.

CREATE TABLE users (
  userid text PRIMARY KEY,
  first_name text,
  last_name text,
  emails set<text>,
  top_scores list<int>,
  todo map<timestamp, text>
);

cassandra – composite partition key

Using a composite partition key 

A composite partition key is a partition key consisting of multiple columns. You use an extra set of parentheses to enclose columns that make up the composite partition key. The columns within the primary key definition but outside the nested parentheses are clustering columns. These columns form logical sets inside a partition to facilitate retrieval.

CREATE TABLE Cats (
  block_id uuid,
  breed text,
  color text,
  short_hair boolean,
  PRIMARY KEY ((block_id, breed), color, short_hair)
);

For example, the composite partition key consists of block_id and breed. The clustering columns, color and short_hair, determine the clustering order of the data. Generally, Cassandra will store columns having the same block_id but a different breed on different nodes, and columns having the same block_id and breed on the same node.

cassandra – composite primary key

Defining a primary key column 

The only schema information that must be defined for a table is the primary key and its associated data type. Unlike earlier versions, CQL 3 does not require a column in the table that is not part of the primary key. A primary key can have any number (1 or more) of component columns.

If the primary key consists of only one column, you can use the keywords, PRIMARY KEY, after the column definition:

CREATE TABLE users (
  user_name varchar PRIMARY KEY,
  password varchar,
  gender varchar,
  session_token varchar,
  state varchar,
  birth_year bigint
);

Alternatively, you can declare the primary key consisting of only one column in the same way as you declare a compound primary key. Do not use a counter column for a key.

Using a compound primary key 

A compound primary key consists of more than one column. Cassandra treats the first column declared in a definition as the partition key. To create a compound primary key, use the keywords, PRIMARY KEY, followed by the comma-separated list of column names enclosed in parentheses.

CREATE TABLE emp (
  empID int,
  deptID int,
  first_name varchar,
  last_name varchar,
  PRIMARY KEY (empID, deptID)
);

cassandra – create table

Define a new table.

Synopsis 

CREATE TABLE keyspace_name.table_name 
( column_definition, column_definition, ...)
WITH property AND property ...

column_definition is:

column_name cql_type
| column_name cql_type PRIMARY KEY
| PRIMARY KEY ( partition_key )
| column_name collection_type

cql_type is a type, other than a collection or a counter type. CQL data types lists the types. Exceptions: ADD supports a collection type and also, if the table is a counter, a counter type.

partition_key is:

column_name
| ( column_name1
        , column_name2, column_name3 ... )
| ((column_name1*, column_name2*), column3*, column4* . . . )

column_name1 is the partition key.

column_name2, column_name3 … are clustering columns.

column_name1*, column_name2* are partitioning keys.

column_name3*, column_name4* … are clustering columns.

collection_type is:

LIST <cql_type>
| SET <cql_type>
| MAP <cql_type, cql_type>

property is a one of the CQL table property, enclosed in single quotation marks in the case of strings, or one of these directives:

  • COMPACT STORAGE
  • CLUSTERING ORDER followed by the clustering order specification.

cassandra – writes

About writes

To manage and access data in Cassandra, it is important to understand how Casssandra writes and reads data, the hinted handoff feature, areas of conformance and non-conformance to the ACID (atomic, consistent, isolated, durable) database properties. In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas.

Cassandra includes client utilities and application programming interfaces (APIs) for developing applications for data storage and retrieval.

cassandra – managing data

Managing data

Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. The storage engine writes sequentially to disk in append mode and stores data contiguously. Operations are parallel within cross nodes and within an individual machine. Because Cassandra does not use a B-tree, concurrency control is unnecessary. Nothing needs to be updated when writing.

Cassandra accommodates modern solid-state disks (SSDs) extremely well. Inexpensive, consumer SSDs are fine for use with Cassandra because Cassandra minimizes wear and tear on an SSD. The disk I/O performed by Cassandra is minimal.

cassandra – atomicity and isolation at row-level

Cassandra supports atomicity and isolation at the row-level, but trades transactional isolation and atomicity for high availability and fast write performance. Cassandra writes are durable.

Cassandra – Data model – Basic Goals

These are the two high-level goals for your data model:

  1. Spread data evenly around the cluster
  2. Minimize the number of partitions read

Rule 1: Spread Data Evenly Around the Cluster

You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.

Rule 2: Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.

Conflicting Rules?

If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.

The point is, these two goals often conflict, so you’ll need to try to balance them.

cassandra – More number of writes, data duplication and denormalization

Number of writes can be more

Writes in Cassandra aren’t free, but they’re awfully cheap.

Cassandra is optimized for high write throughput, and almost all writes are equally efficient [1].

If you can perform extra writes to improve the efficiency of your read queries, it’s almost always a good tradeoff.

Reads tend to be more expensive and are much more difficult to tune.

Denormalization and Data Duplication is Normal

Denormalization and duplication of data is a fact of life with Cassandra. Don’t be afraid of it. Disk space is generally the cheapest resource (compared to CPU, memory, disk IOPs, or network), and Cassandra is architected around that fact. In order to get the most efficient reads, you often need to duplicate data.