Cassandra Query Language

CQL vs SQL


When Apache Cassandra was originally released, it featured a command line interface for dealing with thrift. Manipulating data this way was cumbersome and required learning the details of the API. This introduced an additional hurdle for new users coming from a relational database background. Exposing this low level interface also meant that language-specific drivers often needed to make significant updates from version to version. Clearly, an abstraction was needed.

The Cassandra Query Language (CQL) was created to provide the necessary abstraction. CQL is purposefully similar to Structured Query Language (SQL) used in relational databases like MySQL and Postgres. This similarity lowers the barrier of entry for users familiar with relational databases. Many queries are very similar between the two. In fact, a lot of basic things are even exactly the same. The following is both valid CQL and SQL:

Valid CQL and SQL

USE myDatabase;
/* Creating Tables */
CREATE TABLE IF NOT EXISTS myTable (id INT PRIMARY KEY);
/* Altering Tables /*
ALTER TABLE myTable ADD myField INT;
/* Creating Indexes */
CREATE INDEX myIndex ON myTable (myField);
/* Inserting Data */
INSERT INTO myTable (id, myField) VALUES (1, 7);
/* Selecting Data */
SELECT * FROM myTable WHERE myField = 7;
/* Counting Data */
SELECT COUNT(*) FROM myTable;
/* Deleting Data */
DELETE FROM myTable WHERE myField = 7;

From this, its easy to think that the two languages work just the same. But there are quite a few differences once you get beyond the basics.

Differences

Cassandra is a non-relational database, and so uses different concepts to store and retrieve data. Simplistically, a Cassandra keyspace is a SQL database, and a Cassandra column family is a SQL table (CQL allows you to interchange the words “TABLE” and “COLUMNFAMILY” for convenience). This difference necessitates a different syntax for creating and manipulating data:

Creating databases:

/* Create a new keyspace in CQL */
CREATE KEYSPACE myDatabase WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1};/* Create a new database in SQL */
CREATE DATABASE myDatabase;

Cassandra’s keyspaces require more specifications than a standard relational database. Note that the example above is the simplest form. In a production environment, likely spread across several data-centers, a keyspace would be created with a different strategy and replication factor.

Organizing Data

None of the standard relational stuff is going to work in CQL. There is no support for things like JOIN, GROUP BY, or FOREIGN KEY. Leaving these features out is important because it makes writing and retrieving data from Cassandra much more efficient.

But sets of data tend to have relationships with one another. So without the relational tools of SQL, we need another strategy to represent these relationships. The problem is that reads in Cassandra tend to be more expensive than you might be used to when dealing with a relational database. But there is a filp side to this. And if you take only one thing away from this post, let it be this:

Writes are cheap. Write everything the way you want to read it.

In other words, any query you plan to do in the future should already be organized into a column family. You want to look up addresses that correspond to a person? Then make a column family that consists of people and addresses. You can still store them separately, but you should also store them together. Having a column family to represent every query makes reads much more efficient. You’re not just going to have to de-normalize, you’re going to have to forget about normalization all together.

Inserts vs Updates

The concept of cheap writes extends to updating data. This is because, unlike in SQL, a read is not performed during the update. The syntax is the same in CQL and SQL.

/* Updating data */
UPDATE
myTable SET myField = 2 WHERE id = 6;

However, if the row does not exist, it will still get created. Similarly as unintuitive, an INSERT statement will actually replace data if it exists. Because again, CQL does not perform a read while inserting. Without a read, there is no way to know if the data being inserted is replacing an existing record. This means that both inserts and updates are extremely fast.

Time to Live

CQL enables you to set a TTL on a row. Meaning that you can set a row to expire 24 hours from the time it gets created. This is accomplished with the USING TTL command (values are in seconds).

/* Expiring Data in 24 Hours */
INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400;

Twenty-four hours after that query is executed, the data will be deleted. Well, actually…

Deletions

Executing a DELETE in CQL does not actually delete data. But deletions are a topic of their own. DataStax has a great introduction to Cassandra deletions that I would definitely recommend reading.

Performance Issues

Scaling Cassandra can be daunting the first time around. But when you inevitably run into performance issues, there are a few things you can look at, in addition to the things mentioned above.

Tracing Queries

There is a command in CQL that allows you to trace queries across nodes and data centers, showing some useful debugging information.

/* Enable query traces */
TRACING ON;
/* Disable query traces */
TRACING OFF;

This will show the full network path the query takes, along with the latency at each step. This is helpful for verifying assertions about the replication factor or the connectivity between nodes and data centers. Note that tracing is expensive and should be used sparingly. You definitely do not want it on for all your production queries.

Filtering

If you attempt to run a SELECT across a large range of values, you might end up with an error message.

/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.

I think the error message here speaks for itself. If you’re allowing filtering on your CQL queries, think above whether or not you really need to do that.

Conclusion

The similarity between CQL and SQL is really a double edged sword. While you should enjoy the easy-to-learn syntax, take the time to understand the differences. There is a lot more to Cassandra than I could possibly cover here. If you’re writing production queries for the first time, I would recommend reading up on the query you’re writing, even if it is something as simple as a SELECT. The DataStax CQL documentation is particularly helpful in calling out caveats and providing helpful examples.

Further Reading