NoSQL – Cassandra, HBase
cassandra – table properties
Setting a table property
Using the optional WITH clause and keyword arguments, you can configure caching, compaction, and a number of other operations that Cassandra performs on new table. You can use the WITH clause to specify the properties of tables listed in CQL table properties. Enclose a string property in single quotation marks. For example:
CREATE TABLE MonkeyTypes (
block_id uuid,
species text,
alias text,
population varint,
PRIMARY KEY (block_id)
)
WITH comment='Important biological records'
AND read_repair_chance = 1.0;
CREATE TABLE DogTypes (
block_id uuid,
species text,
alias text,
population varint,
PRIMARY KEY (block_id)
) WITH compression =
{ 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction =
{ 'class' : 'SizeTieredCompactionStrategy', 'min_threshold' : 6 };
cassandra – column type as collection
Defining a column
You assign columns a type during table creation. Column types, other than collection-type columns, are specified as a parenthesized, comma-separated list of column name and type pairs.
This example shows how to create a table that includes collection-type columns: map, set, and list.
CREATE TABLE users (
userid text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>,
top_scores list<int>,
todo map<timestamp, text>
);
cassandra – composite partition key
Using a composite partition key
A composite partition key is a partition key consisting of multiple columns. You use an extra set of parentheses to enclose columns that make up the composite partition key. The columns within the primary key definition but outside the nested parentheses are clustering columns. These columns form logical sets inside a partition to facilitate retrieval.
CREATE TABLE Cats (
block_id uuid,
breed text,
color text,
short_hair boolean,
PRIMARY KEY ((block_id, breed), color, short_hair)
);
For example, the composite partition key consists of block_id and breed. The clustering columns, color and short_hair, determine the clustering order of the data. Generally, Cassandra will store columns having the same block_id but a different breed on different nodes, and columns having the same block_id and breed on the same node.
cassandra – composite primary key
Defining a primary key column
The only schema information that must be defined for a table is the primary key and its associated data type. Unlike earlier versions, CQL 3 does not require a column in the table that is not part of the primary key. A primary key can have any number (1 or more) of component columns.
If the primary key consists of only one column, you can use the keywords, PRIMARY KEY, after the column definition:
CREATE TABLE users (
user_name varchar PRIMARY KEY,
password varchar,
gender varchar,
session_token varchar,
state varchar,
birth_year bigint
);
Alternatively, you can declare the primary key consisting of only one column in the same way as you declare a compound primary key. Do not use a counter column for a key.
Using a compound primary key
A compound primary key consists of more than one column. Cassandra treats the first column declared in a definition as the partition key. To create a compound primary key, use the keywords, PRIMARY KEY, followed by the comma-separated list of column names enclosed in parentheses.
CREATE TABLE emp (
empID int,
deptID int,
first_name varchar,
last_name varchar,
PRIMARY KEY (empID, deptID)
);
cassandra – create table
Define a new table.
Synopsis
CREATE TABLE keyspace_name.table_name ( column_definition, column_definition, ...) WITH property AND property ...
column_definition is:
column_name cql_type | column_name cql_type PRIMARY KEY | PRIMARY KEY ( partition_key ) | column_name collection_type
cql_type is a type, other than a collection or a counter type. CQL data types lists the types. Exceptions: ADD supports a collection type and also, if the table is a counter, a counter type.
partition_key is:
column_name
| ( column_name1
, column_name2, column_name3 ... )
| ((column_name1*, column_name2*), column3*, column4* . . . )
column_name1 is the partition key.
column_name2, column_name3 … are clustering columns.
column_name1*, column_name2* are partitioning keys.
column_name3*, column_name4* … are clustering columns.
collection_type is:
LIST <cql_type> | SET <cql_type> | MAP <cql_type, cql_type>
property is a one of the CQL table property, enclosed in single quotation marks in the case of strings, or one of these directives:
- COMPACT STORAGE
- CLUSTERING ORDER followed by the clustering order specification.
cassandra – how to choose correct partition key
good partition column
- functional identifier
- high cardinality (lot of distinct values)
Role
Clustering column(s)
Simulate 1-N relationship
sort data (logically and on disk)
cassandra – managing data
Managing data
Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. The storage engine writes sequentially to disk in append mode and stores data contiguously. Operations are parallel within cross nodes and within an individual machine. Because Cassandra does not use a B-tree, concurrency control is unnecessary. Nothing needs to be updated when writing.
Cassandra accommodates modern solid-state disks (SSDs) extremely well. Inexpensive, consumer SSDs are fine for use with Cassandra because Cassandra minimizes wear and tear on an SSD. The disk I/O performed by Cassandra is minimal.
- Throughput and latency
Throughput and latency are key factors affecting Cassandra performance in managing data on disk. - Separate table directories
Cassandra provides fine-grained control of table storage on disk.
cassandra – atomicity and isolation at row-level
Cassandra supports atomicity and isolation at the row-level, but trades transactional isolation and atomicity for high availability and fast write performance. Cassandra writes are durable.
- Atomicity
A brief description about atomicity in Cassandra. - Tunable consistency
Cassandra can be tuned to give you strong consistency in the CAP sense where data is made consistent across all the nodes in a distributed database cluster. - Isolation
About row-level isolation in Cassandra. - Durability
About durable writes in Cassandra.
Cassandra – Data model – Basic Goals
These are the two high-level goals for your data model:
- Spread data evenly around the cluster
- Minimize the number of partitions read
Rule 1: Spread Data Evenly Around the Cluster
You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.
Rule 2: Minimize the Number of Partitions Read
Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.
Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.
Conflicting Rules?
If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.
The point is, these two goals often conflict, so you’ll need to try to balance them.