Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

5/4/2020

Reading time:3 min

Setting Up Cassandra Cluster for Production with Python

by Mustafa İleri

Cassandra is a NoSQL database technology that provides high availability and scalability. It is one of the most important solutions that you should take a look when you need extreme performance.We use Cassandra to store the clickstream data in our scenario.How did we decide to use Cassandra among other alternatives? Cassandra is not a single alternative for store clickstream data effectively. There are different alternatives like MongoDB, Apache HBase or Amazon DynamoDB.We have chosen Cassandra because of;Easy installation, management.Easy to use (Especially than Apache HBase)It is well documented.We use python for coding. Cassandra provides a great ORM for python. It’s a big plus. (https://datastax.github.io/python-driver/object_mapper.html)And of course, we are using pyspark to analyze data. So we need a native driver for the database. Like MySQL or PostgreSQL. We found that what we were looking for (https://github.com/datastax/spark-cassandra-connector).I don’t prefer to talk about performance. Because I don’t make a benchmark between Cassandra and HBase. But both of them claim to provide best performance :)Benchmark Apache HBase vs Apache Cassandra on SSD in a Cloud Environment - HortonworksAs more and more workloads are being brought onto modern hardware in the cloud, it's important for us to understand how…hortonworks.comNoSQL Comparison BenchmarksBenchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase Understanding the performance behavior of a…www.datastax.comLet’s start to play with Cassandra.In this blog, I don’t prefer to write about how you can install Cassandra. You can learn it from http://cassandra.apache.org/download/. I will try to describe the configuration of Cassandra.Cassandra originally config file is located in “/etc/cassandra/cassandra.yml”. You can change a lot of configs from this file like cluster_name, authentication settings, permissions, data and log paths etc…You can configure easily a cluster via this config file.In my case, I need 2 master servers that supports replication. So I changed a few parameters to clustering.cluster_name: This is very clear :) The name of cluster.seed_provider > parameters > seeds: You define ip addresses of nodes in your cluster in this section.listen_address: This is the address of the node that will be used from clients.endpoint_snitch: This is more important than others. Because this determines which racks nodes belong to data centers. It is to hard to change if you switch your snitch type later. Every snitch mechanism provides different features. You can get more info from https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchesAbout.htmlResults:I’m not sure that it is big data. So everybody claims to use big data. I will call it just data :) Our Cassandra cluster stores 1 million row for daily. We run 2 t2.large EC2 instances and I don’t have a problem since we started to use(2 months).Let’s look at my example configuration:I have a sample setup, 2 Cassandra node that they work as active and active with replication. You can checkout my sample repo from here.mustafaileri/cassandra-sandboxContribute to mustafaileri/cassandra-sandbox development by creating an account on GitHub.github.comYou can use “cqlsh” as a Cassandra client. You can install it via “pip” or you can use “cqlsh” on docker.or if you want to use it in locally:You can get more info about “cqlsh” from: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlsh.htmlCreate a keyspace that named is “test_keyspace” on node_02. Lets create a table via used with “cassandra-dirver” on python.And now, a table should be created on your Cassandra nodes. You can check by connect your nodes and run this command:You will see same table in your nodes.3. Write data to Cassandra:You can use below command to write sample data to CassandraNow check your nodes, whether data is created on your all nodes.

Illustration Image

Cassandra is a NoSQL database technology that provides high availability and scalability. It is one of the most important solutions that you should take a look when you need extreme performance.

We use Cassandra to store the clickstream data in our scenario.

How did we decide to use Cassandra among other alternatives?
Cassandra is not a single alternative for store clickstream data effectively. There are different alternatives like MongoDB, Apache HBase or Amazon DynamoDB.

We have chosen Cassandra because of;

  1. Easy installation, management.
  2. Easy to use (Especially than Apache HBase)
  3. It is well documented.
  4. We use python for coding. Cassandra provides a great ORM for python. It’s a big plus. (https://datastax.github.io/python-driver/object_mapper.html)
  5. And of course, we are using pyspark to analyze data. So we need a native driver for the database. Like MySQL or PostgreSQL. We found that what we were looking for (https://github.com/datastax/spark-cassandra-connector).
  6. I don’t prefer to talk about performance. Because I don’t make a benchmark between Cassandra and HBase. But both of them claim to provide best performance :)

Let’s start to play with Cassandra.

In this blog, I don’t prefer to write about how you can install Cassandra. You can learn it from http://cassandra.apache.org/download/. I will try to describe the configuration of Cassandra.

Cassandra originally config file is located in “/etc/cassandra/cassandra.yml”. You can change a lot of configs from this file like cluster_name, authentication settings, permissions, data and log paths etc…

You can configure easily a cluster via this config file.

In my case, I need 2 master servers that supports replication. So I changed a few parameters to clustering.

cluster_name: This is very clear :) The name of cluster.

seed_provider > parameters > seeds: You define ip addresses of nodes in your cluster in this section.

listen_address: This is the address of the node that will be used from clients.

endpoint_snitch: This is more important than others. Because this determines which racks nodes belong to data centers. It is to hard to change if you switch your snitch type later. Every snitch mechanism provides different features. You can get more info from https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchesAbout.html

Results:

I’m not sure that it is big data. So everybody claims to use big data. I will call it just data :) Our Cassandra cluster stores 1 million row for daily. We run 2 t2.large EC2 instances and I don’t have a problem since we started to use(2 months).

Let’s look at my example configuration:

I have a sample setup, 2 Cassandra node that they work as active and active with replication. You can checkout my sample repo from here.

You can use “cqlsh” as a Cassandra client. You can install it via “pip” or you can use “cqlsh” on docker.

or if you want to use it in locally:

You can get more info about “cqlsh” from: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlsh.html

  1. Create a keyspace that named is “test_keyspace” on node_0

2. Lets create a table via used with “cassandra-dirver” on python.

And now, a table should be created on your Cassandra nodes. You can check by connect your nodes and run this command:

You will see same table in your nodes.

3. Write data to Cassandra:
You can use below command to write sample data to Cassandra

Now check your nodes, whether data is created on your all nodes.

Related Articles

node
python
astra

GitHub - Anant/Cassandra.Api: Open Source Application for DataStax Astra

Anant

3/7/2024

cassandra
python

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra