cassandra – counter type

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra counter type

To use counter types, see the DataStax blog about counters and Using a counter. Do not assign this type to a column that serves as the primary key. Also, do not use the counter type in a table that contains anything other than counter types (and primary key). To generate sequential numbers for surrogate keys, use the timeuuid type instead of the counter type. You cannot create an index on a counter column.

cassandra – table properties

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra table properties

Setting a table property

Using the optional WITH clause and keyword arguments, you can configure caching, compaction, and a number of other operations that Cassandra performs on new table. You can use the WITH clause to specify the properties of tables listed in CQL table properties. Enclose a string property in single quotation marks. For example:

CREATE TABLE MonkeyTypes (
  block_id uuid,
  species text,
  alias text,
  population varint,
  PRIMARY KEY (block_id)
)
WITH comment='Important biological records'
AND read_repair_chance = 1.0;

CREATE TABLE DogTypes (
  block_id uuid,
  species text,
  alias text,
  population varint,
  PRIMARY KEY (block_id)
) WITH compression =
    { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
  AND compaction =
    { 'class' : 'SizeTieredCompactionStrategy', 'min_threshold' : 6 };

cassandra – column type as collection

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra column type

Defining a column

You assign columns a type during table creation. Column types, other than collection-type columns, are specified as a parenthesized, comma-separated list of column name and type pairs.

This example shows how to create a table that includes collection-type columns: map, set, and list.

CREATE TABLE users (
  userid text PRIMARY KEY,
  first_name text,
  last_name text,
  emails set<text>,
  top_scores list<int>,
  todo map<timestamp, text>
);

cassandra – composite partition key

November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra partition key

Using a composite partition key

A composite partition key is a partition key consisting of multiple columns. You use an extra set of parentheses to enclose columns that make up the composite partition key. The columns within the primary key definition but outside the nested parentheses are clustering columns. These columns form logical sets inside a partition to facilitate retrieval.

CREATE TABLE Cats (
  block_id uuid,
  breed text,
  color text,
  short_hair boolean,
  PRIMARY KEY ((block_id, breed), color, short_hair)
);

For example, the composite partition key consists of block_id and breed. The clustering columns, color and short_hair, determine the clustering order of the data. Generally, Cassandra will store columns having the same block_id but a different breed on different nodes, and columns having the same block_id and breed on the same node.

cassandra – composite primary key

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra primary key

Defining a primary key column

The only schema information that must be defined for a table is the primary key and its associated data type. Unlike earlier versions, CQL 3 does not require a column in the table that is not part of the primary key. A primary key can have any number (1 or more) of component columns.

If the primary key consists of only one column, you can use the keywords, PRIMARY KEY, after the column definition:

CREATE TABLE users (
  user_name varchar PRIMARY KEY,
  password varchar,
  gender varchar,
  session_token varchar,
  state varchar,
  birth_year bigint
);

Alternatively, you can declare the primary key consisting of only one column in the same way as you declare a compound primary key. Do not use a counter column for a key.

Using a compound primary key

A compound primary key consists of more than one column. Cassandra treats the first column declared in a definition as the partition key. To create a compound primary key, use the keywords, PRIMARY KEY, followed by the comma-separated list of column names enclosed in parentheses.

CREATE TABLE emp (
  empID int,
  deptID int,
  first_name varchar,
  last_name varchar,
  PRIMARY KEY (empID, deptID)
);

cassandra – create table

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra create table

Define a new table.

Synopsis

CREATE TABLE keyspace_name.table_name 
( column_definition, column_definition, ...)
WITH property AND property ...

column_definition is:

column_name cql_type
| column_name cql_type PRIMARY KEY
| PRIMARY KEY ( partition_key )
| column_name collection_type

cql_type is a type, other than a collection or a counter type. CQL data types lists the types. Exceptions: ADD supports a collection type and also, if the table is a counter, a counter type.

partition_key is:

column_name
| ( column_name1
        , column_name2, column_name3 ... )
| ((column_name1*, column_name2*), column3*, column4* . . . )

column_name1 is the partition key.

column_name2, column_name3 … are clustering columns.

column_name1*, column_name2* are partitioning keys.

column_name3*, column_name4* … are clustering columns.

collection_type is:

LIST <cql_type>
| SET <cql_type>
| MAP <cql_type, cql_type>

property is a one of the CQL table property, enclosed in single quotation marks in the case of strings, or one of these directives:

COMPACT STORAGE
CLUSTERING ORDER followed by the clustering order specification.

cassandra – how to choose correct partition key

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra partition key

good partition column

functional identifier
high cardinality (lot of distinct values)

Role

Clustering column(s)

Simulate 1-N relationship

sort data (logically and on disk)

cassandra – managing data

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra - managing data

Managing data

Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. The storage engine writes sequentially to disk in append mode and stores data contiguously. Operations are parallel within cross nodes and within an individual machine. Because Cassandra does not use a B-tree, concurrency control is unnecessary. Nothing needs to be updated when writing.

Cassandra accommodates modern solid-state disks (SSDs) extremely well. Inexpensive, consumer SSDs are fine for use with Cassandra because Cassandra minimizes wear and tear on an SSD. The disk I/O performed by Cassandra is minimal.

Throughput and latency
Throughput and latency are key factors affecting Cassandra performance in managing data on disk.
Separate table directories
Cassandra provides fine-grained control of table storage on disk.

cassandra – atomicity and isolation at row-level

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra-atomicity-isolation-rowlevel

Cassandra supports atomicity and isolation at the row-level, but trades transactional isolation and atomicity for high availability and fast write performance. Cassandra writes are durable.

Atomicity
A brief description about atomicity in Cassandra.
Tunable consistency
Cassandra can be tuned to give you strong consistency in the CAP sense where data is made consistent across all the nodes in a distributed database cluster.
Isolation
About row-level isolation in Cassandra.
Durability
About durable writes in Cassandra.

Cassandra – Data model – Basic Goals

November 18, 2016November 18, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra data model

These are the two high-level goals for your data model:

Spread data evenly around the cluster
Minimize the number of partitions read

Rule 1: Spread Data Evenly Around the Cluster

You want every node in the cluster to have roughly the same amount of data. Cassandra makes this easy, but it’s not a given. Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY. So, the key to spreading data evenly is this: pick a good primary key. I’ll explain how to do this in a bit.

Rule 2: Minimize the Number of Partitions Read

Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

Why is this important? Each partition may reside on a different node. The coordinator will generally need to issue separate commands to separate nodes for each partition you request. This adds a lot of overhead and increases the variation in latency. Furthermore, even on a single node, it’s more expensive to read from multiple partitions than from a single one due to the way rows are stored.

Conflicting Rules?

If it’s good to minimize the number of partitions that you read from, why not put everything in a single big partition? You would end up violating Rule #1, which is to spread data evenly around the cluster.

The point is, these two goals often conflict, so you’ll need to try to balance them.

Start-Up Ideas, Tech Code, Use Cases, Thoughts

NoSQL – Cassandra, HBase