Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

10/24/2018

Reading time:5 min

Setting Up a Cassandra Cluster in AWS - DZone Cloud

by John Doe

Apache Cassandra is a NoSQL database that allows for easy horizontal scaling, using the consistent hashing mechanism. Seven years ago, I tried it and decided not use it for a side-project of mine because it was too new. But things are different now — Cassandra is well-established, there’s a company behind it (DataStax), and there are a lot more tools, documentation, and community support. So once again, I decided to try Cassandra. This time, I need it to run in a cluster on AWS, so I went on to set up such a cluster. Googling how to do it gives several interesting results, like this, this, and this, but they are incomplete, outdated, or have too many irrelevant details. So they are only of moderate help. My goal is to use CloudFormation (or Terraform potentially) to launch a stack that has a Cassandra auto-scaling group (in a single region) that can grow as easily as increasing the number of nodes in the group. Also, in order to have the web application connect to Cassandra without hardcoding the node IPs, I wanted to have a load balancer in front of all Cassandra nodes to do the round-robin for me. The alternative to that would be to have a client-side round-robin, but that would mean some extra complexity on the client, which seems avoidable with a load balancer in front of the Cassandra auto-scaling group. The relevant bits from my CloudFormation JSON can be seen here. Here's what it does: Sets up three private subnets (1 per availability zone in the eu-west region) Creates a security group that allows incoming and outgoing ports that allow Cassandra to accept connections (9042) and for the nodes to gossip (7000/7001). Note that the ports are only accessible from within the VPC — no external connection is allowed. SSH goes only through a bastion host. Defines a TCP load balancer for port 9042 where all clients will connect. The load balancer requires a so-called “Target group,” which is defined as well. Configures an auto-scaling group with a pre-configured number of nodes. The autoscaling group has a reference to the “target group”so that the load balancer always sees all nodes in the auto-scaling group. Each node in the auto-scaling group is identical based on a launch configuration. The launch configuration runs a few scripts on initialization. These scripts will be run for every node – either initially, when case a node dies and another one is spawned in its place, or when the cluster has to grow. The scripts are fetched from S3, where you can publish them (and version them) either manually or with an automated process. Note: This does not configure specific EBS volumes and, in reality, you may need to configure and attach them if the instance storage is insufficient. Don’t worry about nodes dying, though, as data is safely replicated. That was the easy part – a bunch of AWS resources and port configurations. The Cassandra-specific setup is a bit harder, as it requires understanding of how Cassandra functions. The two scripts are setup-cassandra.sh and update-cassandra-cluster-config.py, so bash and Python: bash for setting-up the machine, and Python for Cassandra-specific stuff. Instead of the bash script, one could use a pre-built AMI (image), e.g. with Packer, but since only two pieces of software are installed, I thought it was a bit of an overhead to support AMIs. The bash script can be seen here, and simply installs Java 8 and the latest Cassandra, runs the Python script, runs the Cassandra services, and creates (if needed) a keyspace with proper replication configuration. A few notes here – the cassandra.yaml.template could be supplied via the CloudFormation script instead of having it fetched via bash (and having to pass the bucket name); you could also have it fetched in the Python script itself – it’s a matter of preference. Cassandra is not configured for use with SSL, which is generally a bad idea, but the SSL configuration is out of the scope of this basic setup. Finally, the script waits for the Cassandra process to run (using a while/sleep loop) and then creates the keyspace if needed. The keyspace (=database) has to be created with a NetworkTopologyStrategy, and the number of replicas for the particular datacenter (=AWS region) has to be configured. The value is 3, for the 3 availability zones where we’ll have nodes. That means there’s a copy in each AZ (which is seen like a “rack”, although it’s exactly that). The Python script does some very important configurations – without them, the cluster won’t work. (I don’t work with Python normally, so feel free to criticize my Python code). The script does the following: Gets the current auto-scaling group details (using AWS EC2 APIs). Sorts the instances by time. Fetches the first instance in the group in order to assign it as seed node. Sets the seed node in the configuration file (by replacing a placeholder). Sets the listen_address (and therefore rpc_address) to the private IP of the node in order to allow Cassandra to listen for incoming connections. Designating the seed node is important, as all cluster nodes have to join the cluster by specifying at least one seed. You can get the first two nodes instead of just one, but it shouldn’t matter. Note that the seed node is not always fixed – it’s just the oldest node in the cluster. If at some point the oldest node is terminated, each new node will use the second oldest as seed. What I haven’t shown is the cassandra.yaml.template file. It is basically a copy of the cassandra.yaml file from a standard Cassandra installation, with a few changes: cluster_name is modified to match your application name. This is just for human-readable purposes, so it doesn’t matter what you set it to. allocate_tokens_for_keyspace: your_keyspace is uncommented and the keyspace is set to match your main keyspace. This enables the new token distribution algorithm in Cassandra 3.0. It allows for evenly distributing the data across nodes. endpoint_snitch: Ec2Snitch is set instead of the SimpleSnitch to make use of AWS metadata APIs. Note that this setup is in a single region. For multi-region, there’s another snitch and some additional complications of exposing ports and changing the broadcast address. As mentioned above, ${private_ip} and ${seeds} placeholders are placed in the appropriate places (listen_address and rpc_address for the IP) in order to allow substitution. The lets you run a Cassandra cluster as part of your AWS stack, which is auto-scalable and doesn’t require any manual intervention – neither on setup nor on scaling up. Well, allegedly – there may be issues that have to be resolved once you hit the use cases of reality. And for clients to connect to the cluster, simply use the load balancer DNS name (you can print it in a config file on each application node).

Illustration Image

Apache Cassandra is a NoSQL database that allows for easy horizontal scaling, using the consistent hashing mechanism. Seven years ago, I tried it and decided not use it for a side-project of mine because it was too new. But things are different now — Cassandra is well-established, there’s a company behind it (DataStax), and there are a lot more tools, documentation, and community support. So once again, I decided to try Cassandra.

This time, I need it to run in a cluster on AWS, so I went on to set up such a cluster. Googling how to do it gives several interesting results, like this, this, and this, but they are incomplete, outdated, or have too many irrelevant details. So they are only of moderate help.

My goal is to use CloudFormation (or Terraform potentially) to launch a stack that has a Cassandra auto-scaling group (in a single region) that can grow as easily as increasing the number of nodes in the group.

Also, in order to have the web application connect to Cassandra without hardcoding the node IPs, I wanted to have a load balancer in front of all Cassandra nodes to do the round-robin for me. The alternative to that would be to have a client-side round-robin, but that would mean some extra complexity on the client, which seems avoidable with a load balancer in front of the Cassandra auto-scaling group.

The relevant bits from my CloudFormation JSON can be seen here. Here's what it does:

  • Sets up three private subnets (1 per availability zone in the eu-west region)
  • Creates a security group that allows incoming and outgoing ports that allow Cassandra to accept connections (9042) and for the nodes to gossip (7000/7001). Note that the ports are only accessible from within the VPC — no external connection is allowed. SSH goes only through a bastion host.
  • Defines a TCP load balancer for port 9042 where all clients will connect. The load balancer requires a so-called “Target group,” which is defined as well.
  • Configures an auto-scaling group with a pre-configured number of nodes. The autoscaling group has a reference to the “target group”so that the load balancer always sees all nodes in the auto-scaling group.
  • Each node in the auto-scaling group is identical based on a launch configuration. The launch configuration runs a few scripts on initialization. These scripts will be run for every node – either initially, when case a node dies and another one is spawned in its place, or when the cluster has to grow. The scripts are fetched from S3, where you can publish them (and version them) either manually or with an automated process.
  • Note: This does not configure specific EBS volumes and, in reality, you may need to configure and attach them if the instance storage is insufficient. Don’t worry about nodes dying, though, as data is safely replicated.

That was the easy part – a bunch of AWS resources and port configurations. The Cassandra-specific setup is a bit harder, as it requires understanding of how Cassandra functions.

The two scripts are setup-cassandra.sh and update-cassandra-cluster-config.py, so bash and Python: bash for setting-up the machine, and Python for Cassandra-specific stuff. Instead of the bash script, one could use a pre-built AMI (image), e.g. with Packer, but since only two pieces of software are installed, I thought it was a bit of an overhead to support AMIs.

The bash script can be seen here, and simply installs Java 8 and the latest Cassandra, runs the Python script, runs the Cassandra services, and creates (if needed) a keyspace with proper replication configuration. A few notes here – the cassandra.yaml.template could be supplied via the CloudFormation script instead of having it fetched via bash (and having to pass the bucket name); you could also have it fetched in the Python script itself – it’s a matter of preference.

Cassandra is not configured for use with SSL, which is generally a bad idea, but the SSL configuration is out of the scope of this basic setup. Finally, the script waits for the Cassandra process to run (using a while/sleep loop) and then creates the keyspace if needed. The keyspace (=database) has to be created with a NetworkTopologyStrategy, and the number of replicas for the particular datacenter (=AWS region) has to be configured. The value is 3, for the 3 availability zones where we’ll have nodes. That means there’s a copy in each AZ (which is seen like a “rack”, although it’s exactly that).

The Python script does some very important configurations – without them, the cluster won’t work. (I don’t work with Python normally, so feel free to criticize my Python code). The script does the following:

  • Gets the current auto-scaling group details (using AWS EC2 APIs).
  • Sorts the instances by time.
  • Fetches the first instance in the group in order to assign it as seed node.
  • Sets the seed node in the configuration file (by replacing a placeholder).
  • Sets the listen_address (and therefore rpc_address) to the private IP of the node in order to allow Cassandra to listen for incoming connections.

Designating the seed node is important, as all cluster nodes have to join the cluster by specifying at least one seed. You can get the first two nodes instead of just one, but it shouldn’t matter. Note that the seed node is not always fixed – it’s just the oldest node in the cluster. If at some point the oldest node is terminated, each new node will use the second oldest as seed.

What I haven’t shown is the cassandra.yaml.template file. It is basically a copy of the cassandra.yaml file from a standard Cassandra installation, with a few changes:

  • cluster_name is modified to match your application name. This is just for human-readable purposes, so it doesn’t matter what you set it to.
  • allocate_tokens_for_keyspace: your_keyspace is uncommented and the keyspace is set to match your main keyspace. This enables the new token distribution algorithm in Cassandra 3.0. It allows for evenly distributing the data across nodes.
  • endpoint_snitch: Ec2Snitch is set instead of the SimpleSnitch to make use of AWS metadata APIs. Note that this setup is in a single region. For multi-region, there’s another snitch and some additional complications of exposing ports and changing the broadcast address.
  • As mentioned above, ${private_ip} and ${seeds} placeholders are placed in the appropriate places (listen_address and rpc_address for the IP) in order to allow substitution.

The lets you run a Cassandra cluster as part of your AWS stack, which is auto-scalable and doesn’t require any manual intervention – neither on setup nor on scaling up. Well, allegedly – there may be issues that have to be resolved once you hit the use cases of reality. And for clients to connect to the cluster, simply use the load balancer DNS name (you can print it in a config file on each application node).

Related Articles

database
datastax
aws

Getting Started with DataStax Astra DB and Amazon Bedrock | DataStax

John Doe

11/30/2023

aws
cassandra

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

aws