This tutorial gives you just enough information to get you up and running quickly with Apache Cassandra and Python Driver.
It will help you to learn how to install the driver, connect to a Cassandra cluster, create a session and execute some basic CQL statements.
By the end of this blog post on Apache Cassandra and Python Step by Step guide you will go through some basic theory around the Apache Cassandra, the Key difference with other RDBMS, Installing required packages on Ubuntu, Cassandra Driver for Python and at the end basic examples to perform CRUD Operations.
So… let’s cover some details around Apache Cassandra.
What is Apache Cassandra
This is an open source distributed NoSQL database management system. It’s designed to handle large amounts of data across many different commodity servers, hence providing high availability with no single point of failure.
It offers strong support for clusters that span various data centres, with its asynchronous masterless replication allowing low latency operations for all clients.
Cassandra from ten thousand feet
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
[Source Apache Cassandra]
Features:
- It supports replication and multiple data centre replication.
- It has immense scalability.
- It is fault-tolerant.
- It is decentralised.
- It has tunable consistency.
- It provides MapReduce support.
- It supports Cassandra Query Language (CQL) as an alternative to the Structured Query Language (SQL).
The list of companies using Cassandra is vast and constantly growing. This list includes:
- Twitter is using Cassandra for analytics.
- Mahalo uses it for its primary near-time data store.
- Facebook still uses it for inbox search, though they are using a proprietary fork.
- Digg uses it for its primary near-time data store.
- Rackspace uses it for its cloud service, monitoring, and logging.
- Reddit uses it as a persistent cache.
- Cloudkick uses it for monitoring statistics and analytics.
- Ooyala uses it to store and serve near real-time video analytics data.
- SimpleGeo uses it as the main data store for its real-time location infrastructure.
- Onespot uses it for a subset of its main data store.
Learn everything you need to know to use Apache Cassandra
Top Udemy Tutorials To Learn Cassandra For Beginners
Users can interact with Cassandra in multiple ways.
- Command Line Interface (CLI) – The latest version is 1.2.4. For the purpose of learning, we worked on this tutorial in CLI.
- Cassandra Query Language (CQL) – It supports the subset of SQL features. Here standard DDL, DML commands can be used. Specific functions like Group by and Order by are not supported. The latest version is 3.0. Because of trivialness, this report will not discuss CQL in detail.
- DataStax Community Package- is a software package which supports both the CLI and CQL together and is a quick way to interface with Cassandra.
Data Model – Bottom-up approach
For the people coming from traditional RDBMS, the Cassandra data model can be strange, confusing and maybe even a bit difficult to understand.
There are some terms such as keyspace completely new in Cassandra and some terms such as column does not match the meaning in the RDBMS.
Before we dig into some of key data model concepts in Cassandra following bottom-up approach ; we would like to illustrate how Cassandra data model can be mapped to RDBMS.
The above analogy helps make the transition from the relational to the non-relational world.
But don’t use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.
SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
A nested sorted map is a more accurate analogy than a relational table and will help you make the right decisions about your Cassandra data model.
How?
- A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.
- The number of column keys is unbounded. In other words, you can have wide rows.
- A key can itself hold a value. In other words, you can have a valueless column.
Range scan on row keys is possible only when data is partitioned in a cluster using Order Preserving Partitioner (OOP). OOP is almost never used. So, you can think of the outer map as unsorted:
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
As mentioned earlier, there is something called a “Super Column” in Cassandra. Think of this as a grouping of columns, which turns our two nested maps into three nested maps as follows:
Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>
Notes:
- You need to pass the timestamp with each column value, for Cassandra to use internally for conflict resolution. However, the timestamp can be safely ignored during modelling. Also, do not plan to use timestamps as data in your application. They’re not for you, and they do not define new versions of your data (unlike in HBase).
[ Read More… Source: Data Modeling Practise ]
Active Development is done in the Apache Check here; if you wish to check the GitHub repo for future release and issues bookmark below git address.
git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
Installation from Debian packages
(Source – Details are taken from the official link )
Step1 – Add the Apache repository of Cassandra to /etc/apt/sources.list.d/cassandra.sources.list
for example for the 3.11 version:
echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
Step 2 – Add the Apache Cassandra repository keys:
curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
Step 3 – Update the repositories:
sudo apt-get update
If you encounter this error:
GPG error: http://www.apache.org 311x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA
Then add the public key A278B781FE4B2BDA as follows & and then repeat.
Related Posts
sudo apt-key adv --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA
sudo apt-get update
The actual key may be different, you get it from the error message itself.
For a full list of Apache contributors public keys, you can refer to https://www.apache.org/dist/cassandra/KEYS.
Step 4 – Install Cassandra:
sudo apt-get install cassandra
Key Notes : –
- You can start Cassandra with and
sudo service cassandra start
stop it with.sudo service cassandra stop
However, normally the service will start automatically. For this reason be sure to stop it if you need to make any configuration changes. - Verify that Cassandra is running by invoking from
nodetool status
the command line. - The default location of configuration files is.
/etc/cassandra
- The default location of log and data directories is and
/var/log/cassandra/
/var/lib/cassandra
. - Start-up options (heap size, etc) can be configured in.
/etc/default/cassandra
If everything goes well you can see output like below
techfossguru@techfossguru:~$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 103.66 KiB 256 100.0% e3c08fec-a280-4447-8f7f-a7656cce1d50 rack1 techfossguru@techfossguru:~$
Python client driver for Apache Cassandra. – This driver works exclusively with the Cassandra Query Language v3 (CQL3) and Cassandra’s native protocol. Cassandra 2.1+ is supported.
This driver is open source under the Apache v2 License. The source code for this driver can be found on GitHub.
Python 2.6, 2.7, 3.3, and 3.4 are supported. Both CPython (the standard Python implementation) and PyPy are supported and tested.
Linux, OSX, and Windows are supported.
Installation through pip
pip is the suggested tool for installing packages. It will handle installing all Python dependencies for the driver at the same time as the driver itself. To install the driver*:
pip install cassandra-driver
*Note: if intending to use optional extensions, install the dependencies first. The driver may need to be reinstalled if dependencies are added after the initial installation.
Cassandra CQLsh
Cassandra CQLsh stands for Cassandra CQL shell. CQLsh specifies how to use Cassandra commands. CQLsh provides a lot of options which you can see in the following table:
Options | Usage |
help | This command is used to show help topics about the options of CQLsh commands. |
version | it is used to see the version of the CQLsh you are using. |
colour | it is used for coloured output. |
debug | It shows additional debugging information. |
execute | It is used to direct the shell to accept and execute a CQL command. |
file= “filename” | By using this option, Cassandra executes the command in the given file and exits. |
no-color | It directs Cassandra not to use coloured output. |
u “username” | Using this option, you can authenticate a user. The default username is Cassandra. |
p “password” | Using this option, you can authenticate a user with a password. The default password is Cassandra. |
In this part of the tutorial, we focus on the Cassandra Query Language (CQL). Cassandra requires Java 7. Therefore, we first have to make sure the JAVA HOME environment variable point to the right directory.
Connect to Test Cluster at localhost – Type cqlsh on the command prompt. This will connect you to Test Cluster running at your local machine.
techfossguru@techfossguru:~$ cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh>
Create Keyspace in Cassandra- Below command will create ‘techfossguru’ keyspace with replication factor 3.
CREATE KEYSPACE techfossguru WITH REPLICATION = { ’class’ : ’SimpleStrategy’, ’replication_factor’ : 3 };
Use it as the active keyspace-
USE techfossguru;
Create a new table-
CREATE TABLE users (user_id int PRIMARY KEY, first_name text, last_name text );
Insert some data into the table-
INSERT INTO users (user_id, first_name, last_name) VALUES (1, 'Asthana', 'Shiva');
Update some data- using the following statement as an example based on above
UPDATE users SET first_name=’Kanchan’ where user_id = 1;
Please note – If you are aware of basic sql syntax. it has not much difficult to understand key concepts. So…better try your examples.
Let me know! Did all queries succeed? What was the query that gave the error? What error did you get?
Below cheatsheet can be referred to as an example for basic syntax needs
OK..so far so good…we have the stage set now to work on python with Cassandra …
let move to actual manipulation of data using python program..these examples are simple and for introduction purpose only. They are not optimised to work in a real-time production environment!
You can read more details about the Python driver for Cassandra here.
In the next section, two examples will be covered
- Python Crud Example with Casandra – To show the basic usages of Python driver
- Another Example using Flask-CQLAlchemy – This will guide how Flask-CQLAlchemy interacts with Flask in a real-world example
""" Python by Techfossguru Copyright (C) 2017 Satish Prasad """ import logging from cassandra import ConsistencyLevel from cassandra.cluster import Cluster, BatchStatement from cassandra.query import SimpleStatement class PythonCassandraExample: def __init__(self): self.cluster = None self.session = None self.keyspace = None self.log = None def __del__(self): self.cluster.shutdown() def createsession(self): self.cluster = Cluster(['localhost']) self.session = self.cluster.connect(self.keyspace) def getsession(self): return self.session # How about Adding some log info to see what went wrong def setlogger(self): log = logging.getLogger() log.setLevel('INFO') handler = logging.StreamHandler() handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s")) log.addHandler(handler) self.log = log # Create Keyspace based on Given Name def createkeyspace(self, keyspace): """ :param keyspace: The Name of Keyspace to be created :return: """ # Before we create new lets check if exiting keyspace; we will drop that and create new rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces") if keyspace in [row[0] for row in rows]: self.log.info("dropping existing keyspace...") self.session.execute("DROP KEYSPACE " + keyspace) self.log.info("creating keyspace...") self.session.execute(""" CREATE KEYSPACE %s WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' } """ % keyspace) self.log.info("setting keyspace...") self.session.set_keyspace(keyspace) def create_table(self): c_sql = """ CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY, ename varchar, sal double, city varchar); """ self.session.execute(c_sql) self.log.info("Employee Table Created !!!") # lets do some batch insert def insert_data(self): insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)") batch = BatchStatement() batch.add(insert_sql, (1, 'LyubovK', 2555, 'Dubai')) batch.add(insert_sql, (2, 'JiriK', 5660, 'Toronto')) batch.add(insert_sql, (3, 'IvanH', 2547, 'Mumbai')) batch.add(insert_sql, (4, 'YuliaT', 2547, 'Seattle')) self.session.execute(batch) self.log.info('Batch Insert Completed') def select_data(self): rows = self.session.execute('select * from employee limit 5;') for row in rows: print(row.ename, row.sal) def update_data(self): pass def delete_data(self): pass if __name__ == '__main__': example1 = PythonCassandraExample() example1.createsession() example1.setlogger() example1.createkeyspace('techfossguru') example1.create_table() example1.insert_data() example1.select_data()
if everything goes well ..you will see output like below
/home/techfossguru/anaconda3/bin/python /home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py 2018-04-30 22:57:52,560 [INFO] root: creating keyspace... 2018-04-30 22:57:53,029 [INFO] root: setting keyspace... 2018-04-30 22:57:54,795 [INFO] root: Employee Table Created !!! 2018-04-30 22:57:54,818 [INFO] root: Batch Insert Completed LyubovK 2555.0 JiriK 5660.0 YuliaT 2547.0 IvanH 2547.0 Process finished with exit code 0
Sometimes you might face an issue like below…which mostly happens due to connection timeout issue
/home/techfossguru/anaconda3/bin/python /home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py
2018-04-30 22:55:21,436 [INFO] root: dropping existing keyspace…
Traceback (most recent call last):
File “/home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py”, line 97, in <module>
example1.createkeyspace(‘techfossguru’)
File “/home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py”, line 49, in createkeyspace
self.session.execute(“DROP KEYSPACE ” + keyspace)
File “cassandra/cluster.py”, line 2141, in cassandra.cluster.Session.execute
File “cassandra/cluster.py”, line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={‘127.0.0.1’: ‘Client request timeout. See Session.execute_async‘}, last_host=127.0.0.1
Process finished with exit code 1
Creating a Sample Application with Flask and Flask-CQLAlchemy
Manual –
Flask-CQLAlchemy handles connections to Cassandra clusters and provides a Flask-SQLAlchemy like interface to declare models and their columns in a Flask app http://thegeorgeous.com/flask-cqlalch…
As such Flask-CQLAlchemy depends only on the cassandra-driver. It is assumed that you already have flask installed.
Flask-CQLAlchemy has been tested with all minor versions greater than 2.6 of cassandra-driver. All previous versions of Flask-CQLAlchemy are deprecated.
Read more at Official Source – https://flask-cqlalchemy.readthedocs.io/en/latest/
Prerequisites-
At this stage of the tutorial, I assume that you are familiar with Flask and Cassandra and have them installed and running on your local machine.
We will be using pip for installing the Flask-CQLAlchemy – to provide integration with cqlengine.
techfossguru@techfossguru:~$ pip install flask-cqlalchemy Collecting flask-cqlalchemy Downloading https://files.pythonhosted.org/packages/3b/13/eb0af8b7284f8ad4b9829b180ee892b1612fb4263832bcfb9b52ded1a514/Flask_CQLAlchemy-1.2.0-py2.py3-none-any.whl Requirement already satisfied: cassandra-driver>=2.6 in ./anaconda3/lib/python3.6/site-packages (from flask-cqlalchemy) (3.14.0) Collecting blist (from flask-cqlalchemy) Downloading https://files.pythonhosted.org/packages/6b/a8/dca5224abe81ccf8db81f8a2ca3d63e7a5fa7a86adc198d4e268c67ce884/blist-1.3.6.tar.gz (122kB) 100% |████████████████████████████████| 122kB 422kB/s Requirement already satisfied: six>=1.9 in ./anaconda3/lib/python3.6/site-packages (from cassandra-driver>=2.6->flask-cqlalchemy) (1.11.0) Building wheels for collected packages: blist Running setup.py bdist_wheel for blist ... done Stored in directory: /home/techfossguru/.cache/pip/wheels/d2/c2/9e/eab393d3d33d7ba0302e2b85bceca711e01a36055b9fbf2fda Successfully built blist Installing collected packages: blist, flask-cqlalchemy Successfully installed blist-1.3.6 flask-cqlalchemy-1.2.0 techfossguru@techfossguru:~$
Starter App –
from flask import Flask from flask_cqlalchemy import CQLAlchemy import uuid app = Flask(__name__) app.config['CASSANDRA_HOSTS'] = ['127.0.0.1'] app.config['CASSANDRA_KEYSPACE'] = "techfossguru" db = CQLAlchemy(app) class Persons(db.Model): __keyspace__ = 'techfossguru' uid = db.columns.UUID(primary_key=True, default=uuid.uuid4) name = db.columns.Text(primary_key=True) addr = db.columns.Text() @app.route('/') def show_user(): persons = Persons.objects().first() return persons.name if __name__ == '__main__': app.run()
Once you run the above code this will give output as John Doe… when you run your app at 127.0.01:5000.