cassandra – yaml & cassandra-rackdc.properties – Initializing a multiple node cluster (single datacenter)

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra- single datacenter cluster

Prerequisites

Each node must be correctly configured before starting the cluster. You must determine or perform the following before starting the cluster:

A good understanding of how Cassandra works. Be sure to read at least Understanding the architecture, Data replication, and Cassandra’s rack feature.
Install Cassandra on each node.
Choose a name for the cluster.
Get the IP address of each node.
Determine which nodes will be seed nodes. Do not make all nodes seed nodes. Please read Internode communications (gossip).
Determine the snitch and replication strategy. The GossipingPropertyFileSnitch and NetworkTopologyStrategy are recommended for production environments.
If using multiple datacenters, determine a naming convention for each data center and rack, for example: DC1, DC2 or 100, 200 and RAC1, RAC2 or R101, R102. Choose the name carefully; renaming a datacenter is not possible.
Other possible configuration settings are described in cassandra.yaml configuration file and property files such as cassandra-rackdc.properties.

This example describes installing a 6 node cluster spanning 2 racks in a single data center. Each node is configured to use the GossipingPropertyFileSnitch and 256 virtual nodes (vnodes).

Procedure

Suppose you install Cassandra on these nodes:

node0 110.82.155.0 (seed1)
node1 110.82.155.1
node2 110.82.155.2
node3 110.82.156.3 (seed2)
node4 110.82.156.4
node5 110.82.156.5

Note: It is a best practice to have more than one seed node per datacenter.

If you have a firewall running in your cluster, you must open certain ports for communication between the nodes. See Configuring firewall port access.
If Cassandra is running, you must stop the server and clear the data:

Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name.

If Cassandra is running, you must stop the server and clear the data:

Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name.

Package installations:
1. Stop Cassandra:
```
$ sudo service cassandra stop
```
2. Clear the data:
```
$ sudo rm -rf /var/lib/cassandra/data/system/*
```

cassandra – yaml – quick start properties – cluster name

November 17, 2016November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra yaml

cluster_name: (Default: Test Cluster) The name of the cluster. This setting prevents nodes in one logical cluster from joining another. All nodes in a cluster must have the same value.

listen_address

(Default: localhost) The IP address or hostname that Cassandra binds to for connecting to other Cassandra nodes. Set this parameter or listen_interface, not both. You must change the default setting for multiple nodes to communicate:

Generally set to empty. If the node is properly configured (host name, name resolution, and so on), Cassandra uses InetAddress.getLocalHost() to get the local address from the system.
For a single node cluster, you can use the default setting (localhost).
If Cassandra can’t find the correct address, you must specify the IP address or host name.
Never specify 0.0.0.0; it is always wrong.

listen_interface: (Default: eth0)^note The interface that Cassandra binds to for connecting to other Cassandra nodes. Interfaces must correspond to a single address, IP aliasing is not supported. See listen_address.

cassandra – yaml conf properties groups

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra yaml

Quick start
The minimal properties needed for configuring a cluster.
Commonly used
Properties most frequently used when configuring Cassandra.
Performance tuning
Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O, CPU, reads, and writes.
Advanced
Properties for advanced users or properties that are less commonly used.
Security
Server and client security settings.

cassandra – cassandra.yaml file

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra yaml

The cassandra.yaml file is the main configuration file for Cassandra.

Important: After changing properties in the cassandra.yaml file, you must restart the node for the changes to take effect. It is located in the following directories:

Cassandra package installations: /etc/cassandra
Cassandra tarball installations: install_location/conf

cassandra – keyspace

November 17, 2016November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra keyspace

A namespace container that defines how data is replicated on nodes.

cassandra – partitioned row store db

November 17, 2016November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra architecture

Cassandra is a partitioned row store database, where rows are organized into tables with a required primary key.

Cassandra’s architecture allows any authorized user to connect to any node in any datacenter and access data using the CQL language.

Typically, a cluster has one keyspace per application composed of many different tables.

cassandra – compaction

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra architecture

Cassandra periodically consolidates SSTables using a process called compaction, discarding obsolete data marked for deletion with a tombstone.

To ensure all data across the cluster stays consistent, various repair mechanisms are employed.

cassandra – commit log, memtable, sstables

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra architecture

A sequentially written commit log on each node captures write activity to ensure data durability.

Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache.

Each time the memory structure is full, the data is written to disk in an SSTables data file.

All writes are automatically partitioned and replicated throughout the cluster.

cassandra – peer to peer – gossip communication

November 17, 2016 corerootz - Ravi Kiran Krovvidi NoSQL - Cassandra, HBase cassandra architecture

Each node frequently exchanges state information about itself and other nodes across the cluster using peer-to-peer gossip communication protocol.

Spark – locations to configure system

November 16, 2016November 16, 2016 corerootz - Ravi Kiran Krovvidi Spark spark configuration location

Spark provides three locations to configure the system:

Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
Logging can be configured through log4j.properties.

Start-Up Ideas, Tech Code, Use Cases, Thoughts

Author: corerootz - Ravi Kiran Krovvidi

cassandra – yaml & cassandra-rackdc.properties – Initializing a multiple node cluster (single datacenter)

Prerequisites

Procedure

cassandra – yaml – quick start properties – cluster name

cassandra – yaml conf properties groups

cassandra – cassandra.yaml file

cassandra – keyspace

A namespace container that defines how data is replicated on nodes.

cassandra – partitioned row store db

cassandra – compaction

cassandra – commit log, memtable, sstables

cassandra – peer to peer – gossip communication

Spark – locations to configure system