5/4/2020

Reading time:3 min

Setting Up Cassandra Cluster for Production with Python

by Mustafa İleri

Cassandra is a NoSQL database technology that provides high availability and scalability. It is one of the most important solutions that you should take a look when you need extreme performance.We use Cassandra to store the clickstream data in our scenario.How did we decide to use Cassandra among other alternatives? Cassandra is not a single alternative for store clickstream data effectively. There are different alternatives like MongoDB, Apache HBase or Amazon DynamoDB.We have chosen Cassandra because of;Easy installation, management.Easy to use (Especially than Apache HBase)It is well documented.We use python for coding. Cassandra provides a great ORM for python. It’s a big plus. (https://datastax.github.io/python-driver/object_mapper.html)And of course, we are using pyspark to analyze data. So we need a native driver for the database. Like MySQL or PostgreSQL. We found that what we were looking for (https://github.com/datastax/spark-cassandra-connector).I don’t prefer to talk about performance. Because I don’t make a benchmark between Cassandra and HBase. But both of them claim to provide best performance :)Benchmark Apache HBase vs Apache Cassandra on SSD in a Cloud Environment - HortonworksAs more and more workloads are being brought onto modern hardware in the cloud, it's important for us to understand how…hortonworks.comNoSQL Comparison BenchmarksBenchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase Understanding the performance behavior of a…www.datastax.comLet’s start to play with Cassandra.In this blog, I don’t prefer to write about how you can install Cassandra. You can learn it from http://cassandra.apache.org/download/. I will try to describe the configuration of Cassandra.Cassandra originally config file is located in “/etc/cassandra/cassandra.yml”. You can change a lot of configs from this file like cluster_name, authentication settings, permissions, data and log paths etc…You can configure easily a cluster via this config file.In my case, I need 2 master servers that supports replication. So I changed a few parameters to clustering.cluster_name: This is very clear :) The name of cluster.seed_provider > parameters > seeds: You define ip addresses of nodes in your cluster in this section.listen_address: This is the address of the node that will be used from clients.endpoint_snitch: This is more important than others. Because this determines which racks nodes belong to data centers. It is to hard to change if you switch your snitch type later. Every snitch mechanism provides different features. You can get more info from https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchesAbout.htmlResults:I’m not sure that it is big data. So everybody claims to use big data. I will call it just data :) Our Cassandra cluster stores 1 million row for daily. We run 2 t2.large EC2 instances and I don’t have a problem since we started to use(2 months).Let’s look at my example configuration:I have a sample setup, 2 Cassandra node that they work as active and active with replication. You can checkout my sample repo from here.mustafaileri/cassandra-sandboxContribute to mustafaileri/cassandra-sandbox development by creating an account on GitHub.github.comYou can use “cqlsh” as a Cassandra client. You can install it via “pip” or you can use “cqlsh” on docker.or if you want to use it in locally:You can get more info about “cqlsh” from: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlsh.htmlCreate a keyspace that named is “test_keyspace” on node_02. Lets create a table via used with “cassandra-dirver” on python.And now, a table should be created on your Cassandra nodes. You can check by connect your nodes and run this command:You will see same table in your nodes.3. Write data to Cassandra:You can use below command to write sample data to CassandraNow check your nodes, whether data is created on your all nodes.

Read this article if you want to know more about Setting Up Cassandra Cluster for Production with Python

Cassandra is a NoSQL database technology that provides high availability and scalability. It is one of the most important solutions that you should take a look when you need extreme performance.

We use Cassandra to store the clickstream data in our scenario.

How did we decide to use Cassandra among other alternatives?
Cassandra is not a single alternative for store clickstream data effectively. There are different alternatives like MongoDB, Apache HBase or Amazon DynamoDB.

We have chosen Cassandra because of;

Easy installation, management.
Easy to use (Especially than Apache HBase)
It is well documented.
We use python for coding. Cassandra provides a great ORM for python. It’s a big plus. (https://datastax.github.io/python-driver/object_mapper.html)
And of course, we are using pyspark to analyze data. So we need a native driver for the database. Like MySQL or PostgreSQL. We found that what we were looking for (https://github.com/datastax/spark-cassandra-connector).
I don’t prefer to talk about performance. Because I don’t make a benchmark between Cassandra and HBase. But both of them claim to provide best performance :)

Benchmark Apache HBase vs Apache Cassandra on SSD in a Cloud Environment - Hortonworks

As more and more workloads are being brought onto modern hardware in the cloud, it's important for us to understand how…

hortonworks.com

NoSQL Comparison Benchmarks

Benchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase Understanding the performance behavior of a…

www.datastax.com

Let’s start to play with Cassandra.

In this blog, I don’t prefer to write about how you can install Cassandra. You can learn it from http://cassandra.apache.org/download/. I will try to describe the configuration of Cassandra.

Cassandra originally config file is located in “/etc/cassandra/cassandra.yml”. You can change a lot of configs from this file like cluster_name, authentication settings, permissions, data and log paths etc…

You can configure easily a cluster via this config file.

In my case, I need 2 master servers that supports replication. So I changed a few parameters to clustering.

cluster_name: This is very clear :) The name of cluster.

seed_provider > parameters > seeds: You define ip addresses of nodes in your cluster in this section.

listen_address: This is the address of the node that will be used from clients.

endpoint_snitch: This is more important than others. Because this determines which racks nodes belong to data centers. It is to hard to change if you switch your snitch type later. Every snitch mechanism provides different features. You can get more info from https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchesAbout.html

Results:

I’m not sure that it is big data. So everybody claims to use big data. I will call it just data :) Our Cassandra cluster stores 1 million row for daily. We run 2 t2.large EC2 instances and I don’t have a problem since we started to use(2 months).

Let’s look at my example configuration:

I have a sample setup, 2 Cassandra node that they work as active and active with replication. You can checkout my sample repo from here.

mustafaileri/cassandra-sandbox

Contribute to mustafaileri/cassandra-sandbox development by creating an account on GitHub.

github.com

You can use “cqlsh” as a Cassandra client. You can install it via “pip” or you can use “cqlsh” on docker.

or if you want to use it in locally:

You can get more info about “cqlsh” from: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlsh.html

Create a keyspace that named is “test_keyspace” on node_0

2. Lets create a table via used with “cassandra-dirver” on python.

And now, a table should be created on your Cassandra nodes. You can check by connect your nodes and run this command:

You will see same table in your nodes.

3. Write data to Cassandra:
You can use below command to write sample data to Cassandra

Now check your nodes, whether data is created on your all nodes.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

python

node

python

astra

GitHub - Anant/Cassandra.Api: Open Source Application for DataStax Astra

Anant

3/7/2024

python

java

cassandra

Vald

John Doe

2/11/2024

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

John Doe

12/9/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

Benchmark Apache HBase vs Apache Cassandra on SSD in a Cloud Environment - Hortonworks

As more and more workloads are being brought onto modern hardware in the cloud, it's important for us to understand how…

hortonworks.com

NoSQL Comparison Benchmarks

Benchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase Understanding the performance behavior of a…

www.datastax.com

mustafaileri/cassandra-sandbox

Contribute to mustafaileri/cassandra-sandbox development by creating an account on GitHub.

github.com

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us