Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

11/14/2019

Reading time:5 min

Build a distributed NoSQL database with Apache Cassandra

by John Doe

Recently, I got a rush request to get a three-node Apache Cassandra cluster with a replication factor of two working for a development job. I had little idea what that meant but needed to figure it out quickly—a typical day in a sysadmin's job.Here's how to set up a basic three-node Cassandra cluster from scratch with some extra bits for replication and future node expansion.Basic nodes neededTo start, you need some basic Linux machines. For a production install, you would likely put physical machines into racks, data centers, and diverse locations. For development, you just need something suitably sized for the scale of your development. I used three CentOS 7 virtual machines on VMware that have 20GB thin provisioned disks, two processors, and 4GB of RAM. These three machines are called: CS1 (192.168.0.110), CS2 (192.168.0.120), and CS3 (192.168.0.130).First, do a minimal install of CentOS 7 as an operating system on each machine. To run this in production with CentOS, consider tweaking your firewalld and SELinux. Since this cluster would be used just for initial development, I turned them off.The only other requirement is an OpenJDK 1.8 installation, which is available from the CentOS repository.InstallationCreate a cass user account on each machine. To ensure no variation between nodes, force the same UID on each install:$ useradd --create-home \--uid 1099 cass$ passwd cassDownload the current version of Apache Cassandra (3.11.4 as I'm writing this). Extract the Cassandra archive in the cass home directory like this:$ tar zfvx apache-cassandra-3.11.4-bin.tar.gzThe complete software is contained in ~cass/apache-cassandra-3.11.4. For a quick development trial, this is fine. The data files are there, and the conf/ directory has the important bits needed to tune these nodes into a real cluster.ConfigurationOut of the box, Cassandra runs as a localhost one-node cluster. That is convenient for a quick look, but the goal here is a real cluster that external clients can access and that provides the option to add additional nodes when development and tests need to broaden. The two configuration files to look at are conf/cassandra.yaml and conf/cassandra-rackdc.properties.First, edit conf/cassandra.yaml to set the cluster name, network, and remote procedure call (RPC) interfaces; define peers; and change the strategy for routing requests and replication.Edit conf/cassandra.yaml on each of the cluster nodes.Change the cluster name to be the same on each node: cluster_name: 'DevClust'Change the following two entries to match the primary IP address of the node you are working on:listen_address: 192.168.0.110rpc_address:  192.168.0.110Find the seed_provider entry and look for the - seeds: configuration line. Edit each node to include all your nodes:        - seeds: "192.168.0.110, 192.168.0.120, 192.168.0.130"This enables the local Cassandra instance to see all its peers (including itself).Look for the endpoint_snitch setting and change it to:endpoint_snitch: GossipingPropertyFileSnitchThe endpoint_snitch setting enables flexibility later on if new nodes need to be joined. The Cassandra documentation indicates that GossipingPropertyFileSnitch is the preferred setting for production use; it is also necessary to set the replication strategy that will be presented below.Save and close the cassandra.yaml file.Open the conf/cassandra-rackdc.properties file and change the default values for dc= and rack=. They can be anything that is unique and does not conflict with other local installs. For production, you would put more thought into how to organize your racks and data centers. For this example, I used generic names like:dc=NJDCrack=rack001Start the clusterOn each node, log into the account where Cassandra is installed (cass in this example), enter cd apache-cassandra-3.11.4/bin, and run ./cassandra. A long list of messages will print to the terminal, and the Java process will run in the background.Confirm the clusterWhile logged into the Cassandra user account, go to the bin directory and run $ ./nodetool status. If everything went well, you would see something like:$ ./nodetool statusINFO  [main] 2019-08-04 15:14:18,361 Gossiper.java:1715 - No gossip backlog; proceedingDatacenter: NJDC================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving--  Address       Load       Tokens       Owns (effective)  Host ID                               RackUN  192.168.0.110  195.26 KiB  256          69.2%             0abc7ad5-6409-4fe3-a4e5-c0a31bd73349  rack001UN  192.168.0.120  195.18 KiB  256          63.0%             b7ae87e5-1eab-4eb9-bcf7-4d07e4d5bd71  rack001UN  192.168.0.130  117.96 KiB  256          67.8%             b36bb943-8ba1-4f2e-a5f9-de1a54f8d703  rack001This means the cluster sees all the nodes and prints some interesting information.Note that if cassandra.yaml uses the default endpoint_snitch: SimpleSnitch, the nodetool command above indicates the default locations as Datacenter: datacenter1 and the racks as rack1. In the example output above, the cassandra-racdc.properties values are evident.Run some CQLThis is where the replication factor setting comes in.Create a keystore with a replication factor of two. From any one of the cluster nodes, go to the bin directory and run ./cqlsh 192.168.0.130 (substitute the appropriate cluster node IP address). You can see the default administrative keyspaces with the following:cqlsh> SELECT * FROM system_schema.keyspaces; keyspace_name      | durable_writes | replication--------------------+----------------+-------------------------------------------------------------------------------------        system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}      system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'} system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}             system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}Create a new keyspace with replication factor two, insert some rows, then recall some data:cqlsh> CREATE KEYSPACE TestSpace WITH replication = {'class': 'NetworkTopologyStrategy', 'NJDC' : 2};cqlsh> select * from system_schema.keyspaces where keyspace_name='testspace'; keyspace_name | durable_writes | replication---------------+----------------+--------------------------------------------------------------------------------     testspace |           True | {'NJDC': '2', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}cqlsh> use testspace;cqlsh:testspace> create table users ( userid int PRIMARY KEY, email text, name text );cqlsh:testspace> insert into users (userid, email, name) VALUES (1, 'jd@somedomain.com', 'John Doe');cqlsh:testspace> select * from users; userid | email             | name--------+-------------------+----------      1 | jd@somedomain.com | John DoeNow you have a basic three-node Cassandra cluster running and ready for some development and testing work. The CQL syntax is similar to standard SQL, as you can see from the familiar commands to create a table, insert, and query data.ConclusionApache Cassandra seems like an interesting NoSQL clustered database, and I'm looking forward to diving deeper into its use. This simple setup only scratches the surface of the options available. I hope this three-node primer helps you get started with it, too.

Illustration Image

Recently, I got a rush request to get a three-node Apache Cassandra cluster with a replication factor of two working for a development job. I had little idea what that meant but needed to figure it out quickly—a typical day in a sysadmin's job.

Here's how to set up a basic three-node Cassandra cluster from scratch with some extra bits for replication and future node expansion.

Basic nodes needed

To start, you need some basic Linux machines. For a production install, you would likely put physical machines into racks, data centers, and diverse locations. For development, you just need something suitably sized for the scale of your development. I used three CentOS 7 virtual machines on VMware that have 20GB thin provisioned disks, two processors, and 4GB of RAM. These three machines are called: CS1 (192.168.0.110), CS2 (192.168.0.120), and CS3 (192.168.0.130).

First, do a minimal install of CentOS 7 as an operating system on each machine. To run this in production with CentOS, consider tweaking your firewalld and SELinux. Since this cluster would be used just for initial development, I turned them off.

The only other requirement is an OpenJDK 1.8 installation, which is available from the CentOS repository.

Installation

Create a cass user account on each machine. To ensure no variation between nodes, force the same UID on each install:

$ useradd --create-home \
--uid 1099 cass
$ passwd cass

Download the current version of Apache Cassandra (3.11.4 as I'm writing this). Extract the Cassandra archive in the cass home directory like this:

$ tar zfvx apache-cassandra-3.11.4-bin.tar.gz

The complete software is contained in ~cass/apache-cassandra-3.11.4. For a quick development trial, this is fine. The data files are there, and the conf/ directory has the important bits needed to tune these nodes into a real cluster.

Configuration

Out of the box, Cassandra runs as a localhost one-node cluster. That is convenient for a quick look, but the goal here is a real cluster that external clients can access and that provides the option to add additional nodes when development and tests need to broaden. The two configuration files to look at are conf/cassandra.yaml and conf/cassandra-rackdc.properties.

First, edit conf/cassandra.yaml to set the cluster name, network, and remote procedure call (RPC) interfaces; define peers; and change the strategy for routing requests and replication.

Edit conf/cassandra.yaml on each of the cluster nodes.

Change the cluster name to be the same on each node: 

cluster_name: 'DevClust'

Change the following two entries to match the primary IP address of the node you are working on:

listen_address: 192.168.0.110
rpc_address:  192.168.0.110

Find the seed_provider entry and look for the - seeds: configuration line. Edit each node to include all your nodes:

        - seeds: "192.168.0.110, 192.168.0.120, 192.168.0.130"

This enables the local Cassandra instance to see all its peers (including itself).

Look for the endpoint_snitch setting and change it to:

endpoint_snitch: GossipingPropertyFileSnitch

The endpoint_snitch setting enables flexibility later on if new nodes need to be joined. The Cassandra documentation indicates that GossipingPropertyFileSnitch is the preferred setting for production use; it is also necessary to set the replication strategy that will be presented below.

Save and close the cassandra.yaml file.

Open the conf/cassandra-rackdc.properties file and change the default values for dc= and rack=. They can be anything that is unique and does not conflict with other local installs. For production, you would put more thought into how to organize your racks and data centers. For this example, I used generic names like:

dc=NJDC
rack=rack001

Start the cluster

On each node, log into the account where Cassandra is installed (cass in this example), enter cd apache-cassandra-3.11.4/bin, and run ./cassandra. A long list of messages will print to the terminal, and the Java process will run in the background.

Confirm the cluster

While logged into the Cassandra user account, go to the bin directory and run $ ./nodetool status. If everything went well, you would see something like:

$ ./nodetool status
INFO  [main] 2019-08-04 15:14:18,361 Gossiper.java:1715 - No gossip backlog; proceeding
Datacenter: NJDC
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  192.168.0.110  195.26 KiB  256          69.2%             0abc7ad5-6409-4fe3-a4e5-c0a31bd73349  rack001
UN  192.168.0.120  195.18 KiB  256          63.0%             b7ae87e5-1eab-4eb9-bcf7-4d07e4d5bd71  rack001
UN  192.168.0.130  117.96 KiB  256          67.8%             b36bb943-8ba1-4f2e-a5f9-de1a54f8d703  rack001

This means the cluster sees all the nodes and prints some interesting information.

Note that if cassandra.yaml uses the default endpoint_snitch: SimpleSnitch, the nodetool command above indicates the default locations as Datacenter: datacenter1 and the racks as rack1. In the example output above, the cassandra-racdc.properties values are evident.

Run some CQL

This is where the replication factor setting comes in.

Create a keystore with a replication factor of two. From any one of the cluster nodes, go to the bin directory and run ./cqlsh 192.168.0.130 (substitute the appropriate cluster node IP address). You can see the default administrative keyspaces with the following:

cqlsh> SELECT * FROM system_schema.keyspaces; keyspace_name      | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
        system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
      system_schema |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
 system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
             system |           True |                             {'class': 'org.apache.cassandra.locator.LocalStrategy'}
      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

Create a new keyspace with replication factor two, insert some rows, then recall some data:

cqlsh> CREATE KEYSPACE TestSpace WITH replication = {'class': 'NetworkTopologyStrategy', 'NJDC' : 2};
cqlsh> select * from system_schema.keyspaces where keyspace_name='testspace'; keyspace_name | durable_writes | replication
---------------+----------------+--------------------------------------------------------------------------------
     testspace |           True | {'NJDC': '2', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
cqlsh> use testspace;
cqlsh:testspace> create table users ( userid int PRIMARY KEY, email text, name text );
cqlsh:testspace> insert into users (userid, email, name) VALUES (1, 'jd@somedomain.com', 'John Doe');
cqlsh:testspace> select * from users; userid | email             | name
--------+-------------------+----------
      1 | jd@somedomain.com | John Doe

Now you have a basic three-node Cassandra cluster running and ready for some development and testing work. The CQL syntax is similar to standard SQL, as you can see from the familiar commands to create a table, insert, and query data.

Conclusion

Apache Cassandra seems like an interesting NoSQL clustered database, and I'm looking forward to diving deeper into its use. This simple setup only scratches the surface of the options available. I hope this three-node primer helps you get started with it, too.

Related Articles

spring
angular
rest

GitHub - jhipster/jhipster-sample-app-cassandra: This is a sample application created with JHipster, with the Cassandra option

jhipster

3/7/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra