CQL Under the Hood
At this point, most users should be aware that CQL has replaced Thrift as the standard (and therefore recommended) interface for working with Cassandra. Yet it remains largely misunderstood, as its resemblance to common SQL has left both Thrift veterans and Cassandra newcomers confused about how it translates to the underlying storage layer. This fog must be lifted if you hope to create data models that scale, perform, and insure availability.
First, let’s define some terms:
- Thrift: a legacy RPC protocol combined with a code generation tool. Deprecated in favor of the native protocol.
- Native protocol: replacement for Thrift that only supports CQL.
- Storage rows: keys and columns as stored on disk.
- CQL rows: an abstraction layer on top of the storage rows.
It is important to understand that the CQL data representation does not always match the underlying storage structure. This can be challenging for those accustomed to Thrift-based operations, as those were performed directly against the storage layer. But CQL introduces an abstraction on top of the storage rows, and only maps directly in the simplest of schemas.
If you want to be successful at modeling and querying data in Cassandra, keep in mind that while CQL improves the learning curve, it is not SQL. You must understand what’s happening under the covers, or you will end up with data models that are poorly suited to Cassandra.
So let’s pull back the curtain and look at what our CQL statements translate to at the storage layer, starting with a simple table.
Single Primary Key
The first model we will examine is a straightforward table, which we’ll call books with a single primary key, title:
We can then insert some data, as follows:
And finally we can read our newly inserted rows:
What we’ve done so far looks a lot like ANSI SQL, and in fact these statements would have been valid when run against most modern relational systems. But we know that something very different is happening under the hood.
To see what this looks like at the storage layer, we can use the old command line interface, cassandra-cli, which allows us to interact directly with storage rows. This CLI is deprecated and will likely disappear in the 3.0 release, but for now we can use it to inspect the books table we created using CQL. Listing the contents of our table produces the following output:
RowKey: Without Remorse
As you can see, this is nearly a direct mapping to the CQL rows, except that we have an empty column at the beginning of each row (which is not a mistake; it is used internally by Cassandra).
Let’s point out a couple of important features of this data. First, remember that the row key is distributed randomly using a hash algorithm, so the results are returned in no particular order. By contrast, columns are stored in sorted order by name, using the natural ordering of the type. In this case, “author” comes before “year” lexicographically, so it appears first in the list. These are critical points, as they are central to effective data modeling.
Now let’s look at a slightly more complex example, one which uses a compound key. In this case, we’ll create a new table, authors, with a compound key using name, year, and title:
And this is what our data looks like after inserting two CQL rows:
This is where CQL can begin to cause confusion for those who are unfamiliar with what’s happening at the storage layer. To make sense of this, it’s important to understand the difference between partition keys and clustering columns.
When declaring a primary key, the first field in the list is always the partition key. This translates directly to the storage row key, which is randomly distributed in the cluster via the hash algorithm. In general, you must provide the partition key when issuing queries, so that Cassandra will know which nodes contain the requested data.
The remaining fields in the primary key declaration are called clustering columns, and these determine the ordering of the data on disk. They are not, however, part of the partition key, so they do not help determine the distribution of data in the cluster. But they play a key role in determining the kinds of queries you can run against your data, as we will see in the remainder of this section.
Now that you know the difference, it’s time to see what our authors table looks like in its storage layer representation (with the timestamp and marker columns omitted for clarity):
You will note that our two CQL rows translated to a single storage row, because both of our inserts used the same partition key. But perhaps more interesting is the location of our year and title column values. They are stored as parts of the column name, rather than column values!
Those who are experienced with Thrift-based data models will recognize this structure, which is referred to as composite columns. You can also observe that the rows are sorted first by year, then by title, which is the way we specified them in our primary key declaration. It is also possible to reverse the stored sort order by adding the WITH CLUSTERING ORDER BY clause, as follows:
Then, when selecting our rows, we can see that the ordering starts with the latest year and ends with the earliest:
While this may seem to be a trivial point, it can matter a great deal depending on the types of queries you intend to run on your data. We will examine these implications later in this post when we discuss queries.
Composite Partition Keys
In the previous examples we demonstrated the use of a single partition key with multiple clustering columns. But it’s also possible to create a multi-part (or “composite”) partition key. The most common reason for doing this is to improve data distribution characteristics. A prime example of this is the use of time buckets as keys when modeling time-series data. We will cover this in detail later on.
For now, let’s see what it looks like to create a composite partition key:
The difference, in case it’s not obvious, is the addition of parentheses around the name and year columns, which specifies that these two columns should form the composite partition key. This leaves title as the only remaining clustering column.
At the storage layer, this has the effect of moving the year from a component of the column name to a component of the row key, as follows:
Why This Matters
You may be wondering why it matters how the data is stored internally. In fact it matters a great deal, for several important reasons:
- Your queries must respect the underlying storage. Cassandra doesn’t allow ad hoc queries of the sort that you can perform using SQL on a relational system. If you don’t understand how the data is stored, at best you will be constantly frustrated by the error messages you receive when you try to query your data, and at worst you will suffer poor performance.
- You must choose your partition key carefully, because it must be known at query time, and must also distribute well across the cluster.
- Because of its log-structured storage, Cassandra handles range queries very well. A range query simply means that you select a range of columns for a given key, in the order they are stored.
- You have to carefully order your clustering columns, because the order affects the sort order of your data on disk and therefore determines the kinds of queries you can perform.
Proper data modeling in Cassandra requires you to structure your data in terms of your queries. This is backward compared to the approach taken in most relational models, where normalization is typically the objective. With Cassandra you must consider your queries first.
With these principles in mind, let’s examine what happens when you run different kinds of queries, so you can better understand how to structure your data.
In order to make sense of the various types of queries, we will start with a common data model to be used across the following examples. For this data model, we will return to the authors table, with name as the partition key, followed by year and title as clustering columns. We’ll also sort the year in descending order. This table can be created as follows:
Also, for purposes of these examples, we will assume a replication factor of three and consistency level of QUORUM.
Query by Key
We’ll start with a basic query by key:
For this simple select, the query makes the request to the coordinator node, which owns a replica for our key. The coordinator then retrieves the row from another replica node to satisfy the quorum. Thus, we need a total of two nodes to satisfy the query:
At the storage layer, this query first locates the partition key, then scans all the columns in order, as follows:
So even though this appears to be a simple query by key, at the storage layer it actually translates to a range query!
If this basic query results in a range query, let’s see what happens when we specifically request a range, like this:
In this case we’re still selecting a single partition, so the query must only check with two nodes as in the previous example. The difference is that in this case, Cassandra simply scans the columns until it finds one that fails the query predicate:
Once it finds the year 1991, Cassandra knows there are no more records to scan. Therefore, this query is efficient because it must only read the required number of columns, plus one.
To recap, there are three key points you should take from this discussion:
- Sequential queries are fast, because they take advantage of Cassandra’s natural sort order at the storage layer.
- Queries by key and combination of key plus clustering column are sequential at the storage layer, which of course means they are optimal.
- Write your data the way you intend to read it. Or, put another way, model your data in terms of your queries, not the other way around. Following this rule will help you avoid the most common data modeling pitfalls that plague those who are transitioning from a relational database.
Now that we’ve covered the basics of how to build data models that make optimal use of the storage layer, let’s look at one of Cassandra’s newer features: collections.
The introduction of collections to CQL addresses some of the concerns that frequently arose regarding Cassandra’s primitive data model. They add richer capabilities that give developers more flexibility when modeling certain types of data.
Cassandra supports three collection types: sets, lists, and maps. In this section we will examine each of these and take a look at how they’re stored under the hood. But first, it’s important to understand some basic rules regarding collections:
- Each item in a collection must not be more than 64 KB
- A maximum of 64,000 items may be stored in a single collection
- Querying a collection always returns the entire collection
- Collections are best used for relatively small, bounded data sets
With those rules in mind, we can examine each type of collection in detail, starting with sets.
A set in CQL is very similar to a set in your favorite programming language. It is a unique collection of items, meaning it does not allow for duplicates. In most languages sets have no specific ordering; Cassandra, however, stores them in their natural sort order, as you might expect.
Here is an example of a table of authors which contains a set of books:
We can then insert some values as follows:
Cassandra also supports removing items from a set using the UPDATE statement:
At the storage layer, set values are stored as column names, with the values left blank. This guarantees uniqueness, as any attempt to rewrite the same item would simply result in overwriting the old column name. The storage representation of the books set would look like this:
Set Name Item
You can see that the name of the set is stored as the first component of the composite column name, with the item as the second component. Unfortunately Cassandra does not support a contains operation, so you must retrieve the entire set and perform this on the client. But sets can be quite useful as a container for unique items in a variety of data models.
At the CQL level, lists look very similar to sets. In the following table, we substitute the set of books from the previous example for a list:
Insertion is also similar to the set syntax, except that the curly braces are traded for brackets:
And since lists are ordered, CQL supports prepend and append operations, which involve simply placing the item as either the first (prepend) or second (append) operands, as follows:
To delete an item, you can refer to it by name:
Unlike the set, the list structure at the storage layer places the list item in the column value, and the column name instead contains a UUID for ordering purposes. Here’s what it looks like:
Lastly, maps are a highly useful structure, as they can offer similar flexibility to the old dynamic column names many grew accustomed to in the Thrift days, as long as the total number of columns is kept to a reasonable number.
For example, we can use a map to store not only the book title, but the year as well. Here is what that would look like:
To insert or update an entire map, use the following syntax:
You can also insert or update a single key using array-like syntax, as follows:
Specific values can be also be removed by using a DELETE statement:
At the storage layer, maps look very similar to lists, except the ordering ID is replaced by the map key:
As you can see, all these collection types make use of composite columns, in the same manner as clustering columns.
And now for one of the most common query errors, the IN clause, where we ask for multiple partition keys in a single query. Let’s recall the authors schema we introduced earlier:
Using this schema, let’s say we want to retrieve a number of books from a list of known authors. Obviously we could write a separate query for each author, but Cassandra also provides a familiar SQL-style syntax for specifying multiple partition keys using the IN clause:
The question is how will Cassandra fulfill this request? The system will hash the partition key—name in this case—and assign replicas to nodes based on the hash. Using the three authors in our query as examples, we will end up with a distribution resembling the following:
The important characteristic to note in this distribution is that the keys are dispersed throughout the cluster. If we also remember that a QUORUM read requires consulting with at least two out of three replicas, it is easy to see how this query will result in consulting many nodes. In the following diagram, our client makes a request to one of the nodes, which will act as coordinator. The coordinator must then make requests to at least two replicas for each key in the query:
The end result is that we required five out of six nodes to fulfill this query! If any one of these calls fails, the entire query will fail. It is easy to see how a query with many keys could require participation from every node in the cluster.
When using the IN clause, it’s best to keep the number of keys small. There are valid use cases for this clause, such as querying across time buckets for time-series models, but in such cases you should try to size your buckets such that you only need at most two in order to fulfill the request.
In fact, it is often advisable to issue multiple queries in parallel as opposed to utilizing the IN clause. While the IN clause may save you from multiple network requests to Cassandra, the coordinator must do more work. You can often reduce overall latency and workload with several token-aware queries, as you’ll be talking directly to the nodes that contain the data.
If range queries can be considered optimal for Cassandra’s storage engine, queries based on a secondary index fall at the other end of the spectrum. Secondary indices have been part of Cassandra since the 0.7 release, and they are certainly an alluring feature. In fact, for those who are accustomed to modeling data in relational databases, creating an index is often a go-to strategy to achieve better query performance. However, as with most aspects of the transition to Cassandra, this strategy translates poorly.
To start, let’s get familiar with what secondary indices are and how they work. First off, secondary indices are the only type of index that Cassandra will manage for you, so the terms “index” and “secondary index” actually refer to the same mechanism. The purpose of an index is to allow query-by-value functionality, which is not supported naturally. This should be a clue as to the potential danger involved in relying on the index functionality.
As an example, suppose we want to be able to query authors for a given publisher. Using our earlier authors table, remember that the publisher column has no special properties. It is a simple text column, meaning that by default we cannot filter based on its value. We can take a look at what happens when attempting to do so, as in the following query:
Running this query results in the following error message, indicating that we’re trying to query by the value of a non-indexed column:
The obvious remedy is to simply create an index on publisher, as follows:
Now we can filter on publisher, so our problems are solved, right? Not exactly! Let’s look closely at what Cassandra does to make this work.
Secondary Indices Under the Hood
At the storage layer, a secondary index is simply another column family, where the key is the value of the indexed column, and the columns contain the row keys of the indexed table. This can be a bit confusing to describe, so let’s visualize it.
Imagine our authors table contains the following CQL rows:
An index on publisher would then look like this at the storage layer:
So a query filtering on publisher will use the index to each author name, then query all the authors by key. This is similar to using the IN clause, since we must query replicas for every key with an entry in the index.
But it’s actually even worse than the IN clause, because of a very important difference between indices and standard tables. Cassandra co-locates index entries with their associated original table keys. In other words, you will end up with a key for “Random House” in author_publishers on every node that has keys for “Anne Rice” or “Charles Dickens” in authors.
To make this a bit clearer, the following diagram shows how our co-located authors table and author_publisher index might be distributed across a four-node cluster:
The objective in using this approach is to be able to determine which nodes own indexed keys, as well as to obtain the keys themselves in a single request. But the problem is we have no idea which token ranges contain indexed keys until we ask each range. So now we end up with a query pattern like this:
Obviously the use of secondary indices has an enormous impact on both performance and availability, since so many nodes must participate in fulfilling the query. For this reason it’s best to avoid using them in favor of writing your own indices or choosing another data model entirely.
If you decide to use a secondary index for a use case where performance and availability are not critical, make sure to only index on low cardinality values, as high cardinality indices do not scale well. But don’t go so low that your index is rendered useless. For example, booleans are bad, as are UUIDs, but birth year could be a reasonable column to index.
Deleting Immutable Data
We have established that Cassandra employs a log-structured storage engine, where all writes are immutable appends to the log. The implication is that data cannot actually be deleted at the time a DELETE statement is issued. Cassandra solves this by writing a marker, called a tombstone, with a timestamp greater than the previous value. This has the effect of overwriting the previous value with an empty one, which will then be compiled in subsequent queries for that column in the same manner as any other update. The actual value of the tombstone is set to the time of deletion, which is used to determine when the tombstone can be removed.
Of course you can explicitly delete a column using the DELETE statement, but you may be surprised that a tombstone will be generated for every affected storage layer column. To make this clear, let’s remind ourselves about the structure of a single CQL row as represented at the storage layer:
To this point we have been using a simplified version of the storage row representation. In fact there is a third column used as an internal marker, which has been omitted for clarity. So then, let’s remove the “Patriot Games” entry, as follows:
Using cqlsh with tracing turned on, we then attempt to read the newly deleted record:
If you carefully examine the resulting trace, you will notice a line resembling the following:
So what happened? Our query that returned zero records actually had to read three tombstones to produce the results! The important point to remember is that tombstones cover single storage layer columns, so deleting a CQL row with many columns results in many tombstones.
The Problem with Tombstones
You may be wondering why we’re so concerned about tombstones. The last example should provide a hint as to the reason. When a query requires reading tombstones, Cassandra must perform many additional reads to return your results.
In addition, a query for a key in an sstable that has only tombstones associated with it will still pass through the bloom filter, because the system must reconcile tombstones with other replicas. Since the bloom filter is designed to prevent unnecessary reads for missing data, this means Cassandra will perform extra reads after data has been deleted.
Now that you understand the basics of deletes and the problems associated with them, it’s important to point out the other ways that deletes can be generated—sometimes in ways you would not expect.
Cassandra offers us a handy feature for purging old data through setting an expiration time, called a TTL, at the column level. There are many valid reasons to set TTL values, and they can help to avoid unbounded data accumulation over time. Setting a TTL on a column is straightforward, and can be accomplished using either an INSERT or UPDATEstatement as follows (note that TTL values are in seconds):
This can be useful when dealing with ephemeral data, but you must take care when employing this strategy, because an expired column results in a tombstone as in any other form of delete.
How NOT to use TTLs
A common reason to expire columns is in the case of time-series data. Imagine we want to display a feed of comments associated with a news article, where the newest post appears on top. To avoid holding onto them indefinitely, we set them to expire after a few hours.
So we end up with a model that resembles the following:
It’s important to note that this model is perfectly acceptable so far. Where we can run into problems is when we naively attempt to query for the latest values. It can be tempting to assume that we can simply query everything for a given articleID, with the expectation that old columns will simply disappear. In other words, we perform a query like this:
In some ways this expectation is correct. Old values will disappear from the result set, and for a period of time this query will perform perfectly well. But gradually we will accumulate tombstones as columns reach their expiration time, and this query requires that we read all columns in the storage row. Eventually, we will reach a point where Cassandra will be reading more tombstones than real values!
The solution is simple. We must add a range filter on timestamp, which will tell Cassandra to stop scanning columns at approximately as far back in time as the tombstones will start. In this case, we don’t want to read any columns older than three hours, so our new query looks like this:
Note that you will have to calculate the timestamp in your application, as CQL does not currently support arithmetic operations.
To sum up, expiring columns can be highly useful as long as you do so wisely. Make sure your usage pattern avoids reading excessive numbers of tombstones. Often you can use range filters to accomplish this goal.
When Null Does Not Mean Empty
There is an even subtler (and more insidious) way to inadvertently create tombstones: by inserting null values. Let’s take a look at how we might cause this situation unwittingly.
We know that Cassandra stores columns sparsely, meaning that unspecified values simply aren’t written. So it would seem logical that setting a column to null would result in a missing column. In fact, writing a null is the same thing as explicitly deleting a column, and therefore a tombstone is written for that column!
There is a simple reason why this is the case. While Cassandra supports separate INSERT and UPDATE statements, all writes are fundamentally the same under the covers. And because all writes are simply append operations, there is no way for the system to know whether a previous value exists for the column. Therefore Cassandra must actually write a tombstone in order to guarantee any old values are deleted.
While it may seem as though this would be easy to avoid—by just not writing null values—it is fairly easy to mistakenly allow this to happen when using prepared statements. Imagine a data model that includes many sparsely populated columns. It is tempting to create a single prepared statement with all potential columns, then set the unused columns tonull. It is also possible that callers of an insert method might pass in null values. If this condition is not checked, it is easy to see how tombstones could be accumulated without realizing this is happening.
So to wrap things up, remember two things:
- Thrift is dead. Switch to CQL as soon as possible.
- But make sure you understand your data models and queries. It’s not SQL.
Robbie Strickland, Director of Software Development at The Weather Channel
Robbie works for The Weather Channel’s digital division, as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, Hadoop integration, leading the Atlanta Cassandra Users Group, and answering lots of questions on StackOverflow. This post is adapted from Robbie Strickland’s book, Cassandra High Availability, available for purchase here.