cassandra – yaml & cassandra-rackdc.properties – Initializing a multiple node cluster (single datacenter)

Prerequisites

Each node must be correctly configured before starting the cluster. You must determine or perform the following before starting the cluster:

This example describes installing a 6 node cluster spanning 2 racks in a single data center. Each node is configured to use the GossipingPropertyFileSnitch and 256 virtual nodes (vnodes).

Procedure

  1. Suppose you install Cassandra on these nodes:
    node0 110.82.155.0 (seed1)
    node1 110.82.155.1
    node2 110.82.155.2
    node3 110.82.156.3 (seed2)
    node4 110.82.156.4
    node5 110.82.156.5
    Note: It is a best practice to have more than one seed node per datacenter.
  2. If you have a firewall running in your cluster, you must open certain ports for communication between the nodes. See Configuring firewall port access.
  3. If Cassandra is running, you must stop the server and clear the data:

    Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name.

    If Cassandra is running, you must stop the server and clear the data:

    Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use the same cluster name.

    Package installations:

    1. Stop Cassandra:
      $ sudo service cassandra stop
    2. Clear the data:
      $ sudo rm -rf /var/lib/cassandra/data/system/*
      
      

cassandra – yaml – quick start properties – cluster name

cluster_name 
(Default: Test Cluster) The name of the cluster. This setting prevents nodes in one logical cluster from joining another. All nodes in a cluster must have the same value.
listen_address 
(Default: localhost) The IP address or hostname that Cassandra binds to for connecting to other Cassandra nodes. Set this parameter or listen_interface, not both. You must change the default setting for multiple nodes to communicate:

  • Generally set to empty. If the node is properly configured (host name, name resolution, and so on), Cassandra uses InetAddress.getLocalHost() to get the local address from the system.
  • For a single node cluster, you can use the default setting (localhost).
  • If Cassandra can’t find the correct address, you must specify the IP address or host name.
  • Never specify 0.0.0.0; it is always wrong.
listen_interface 
(Default: eth0)note The interface that Cassandra binds to for connecting to other Cassandra nodes. Interfaces must correspond to a single address, IP aliasing is not supported. See listen_address.

cassandra – yaml conf properties groups

  • Quick start

    The minimal properties needed for configuring a cluster.

  • Commonly used

    Properties most frequently used when configuring Cassandra.

  • Performance tuning

    Tuning performance and system resource utilization, including commit log, compaction, memory, disk I/O, CPU, reads, and writes.

  • Advanced

    Properties for advanced users or properties that are less commonly used.

  • Security

    Server and client security settings.

cassandra – partitioned row store db

Cassandra is a partitioned row store database, where rows are organized into tables with a required primary key.

Cassandra’s architecture allows any authorized user to connect to any node in any datacenter and access data using the CQL language.

 

Typically, a cluster has one keyspace per application composed of many different tables.

 

cassandra – commit log, memtable, sstables

A sequentially written commit log on each node captures write activity to ensure data durability.

Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache.

Each time the memory structure is full, the data is written to disk in an SSTables data file.

All writes are automatically partitioned and replicated throughout the cluster.

Spark – locations to configure system

Spark provides three locations to configure the system:

  • Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
  • Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
  • Logging can be configured through log4j.properties.