In my previous blogpost “Analysis of the impact of new use cases on a Cassandra data model and operations in code” I looked at the risks of deploying Cassandra. In Cassandra we face the challenges of data modelling and designing the CRUD operations (create, read, update, delete), because in Cassandra it is best practice to denormalize the data models. With static columns (available from Cassandra 2.0.6) we are able to restrain some of these challenges for one-to-many relationships.

To explain the approach of using static columns we first need to understand how the models for one-to-many relationships look like in Cassandra. With that explanation I also show the operations on the data models. Next, I will explain what the static columns are exactly. I will also elaborate on the operations on these models and mention some technical limitation of static columns. Next we will take a deeper dive into the read operations for the one side of the one-to-many relation. To conclude I will summarize my findings.

In Cassandra it is best practice to denormalize the data models. We denormalize because the goal in Cassandra is to execute the read operations of the use case with one query or, more preferably, one disk I/O. To illustrate the use case, the data model and the operations I will use an example about teams and team members of Formula 1 racing teams.

In the example we have two entities; teams and team members. A single team can have multiple members and a given team member is member of only a single team. The first use case is that we should be able to create, read, update and delete teams. The second use case is that we should be able to create, read, update, and delete team members of a given team. Obviously we should also be able to read all members for a given team and read the team that a team member is member of.

Assume that for a team we wish to register: the name, the manager and the location. For a team member we wish to register: the name, the nationality and the position within the team. In Cassandra we would store all these fields in one table. In that table the team name will be the partition key and the member name will be the clustering key. The table is created with the following CQL statement.

CREATE TABLE teammember_by_team (
  teamname text,
  manager text,
  location text,
  membername text,
  nationality text,
  position text,
  PRIMARY KEY ((teamname), membername)
);

Insert operation

We can insert a row with the following CQL statement:

INSERT INTO teammember_by_team (teamname, manager, location, membername, nationality, position)
VALUES (‘Toro Rosso’, ‘Franz Tost’, ‘Faenza’, ‘Verstappen’, ‘Dutch’, ‘test driver’);

This statement results in a new row in the table. When we wish to add another team member to the team we need to add values for the team’s fields, again.

Update operation

Updating a team member is relatively easy. For example with the following CQL statement:

UPDATE teammember_by_team SET position = ‘driver’
WHERE teamname = ‘Toro Rosso’ AND membername = ‘Verstappen’;

But when we wish to update a team we need to execute the update for all team members. This is one of the risks and challenges of this model. This update operation may cost many queries, for example the following queries which result in the following table with data.

UPDATE teammember_by_team SET manager = ‘Christian Horner’
WHERE teamname = ‘Red Bull’ AND membername = ‘Ricciardo’;
UPDATE teammember_by_team SET manager = ‘Christian Horner’
WHERE teamname = ‘Red Bull’ AND membername = ‘Kvyat’;
teamname   | membername | location      | manager          | nationality | position
-----------+------------+---------------+------------------+-------------+----------
  Red Bull |  Ricciardo | Milton Keynes | Christian Horner |  Australian |   driver
  Red Bull |      Kvyat | Milton Keynes | Christian Horner |     Russian |   driver
Toro Rosso | Verstappen |        Faenza |       Franz Tost |       Dutch |   driver

A static column is defined upon creation of the table in CQL using the keyword “static”. A normal column is stored within the clustering key, within the partition. This column has a value for every combination of partitioning key and clustering key (in our example team name and team member name). A static column resides one level higher, directly within the partition key (in our example team name).

All fields belonging to the one side of the one-to-many relation may be static columns. In our example this means that the team manager and team location can be static columns. Review the following CQL statement:

CREATE TABLE teammember_by_team (
  teamname text,
  manager text static,
  location text static,
  membername text,
  nationality text,
  position text,
  PRIMARY KEY ((teamname), membername)
);

Because the static columns are not stored within the clustering key, but are still associated to the partitioning key, we are able to insert team data without inserting team member data. When we read this table, we get back the team with all other field having “null”.

INSERT INTO teammember_by_team (teamname, manager, location)
VALUES (‘Red Bull’, ‘Christian Horner’, ‘<unknown>’);
teamname  | membername | location | manager          | nationality | position
----------+------------+----------+------------------+-------------+----------
 Red Bull |       null | <unkown> | Christian Horner |        null |     null

Inserting team members is much easier using static columns, because team data no longer has to be duplicated.

INSERT INTO teammember_by_team (teamname, membername, nationality, position)
VALUES (‘Red Bull’, ‘Ricciardo’, ‘Australian’, ‘driver’);
INSERT INTO teammember_by_team (teamname, membername, nationality, position)
VALUES (‘Red Bull’, ‘Kvyat’, ‘Russian’, ‘driver’);

When we read the table after the preceding statements, the result is as expected.

teamname  | membername | location  | manager          | nationality | position
----------+------------+-----------+------------------+-------------+----------
 Red Bull |  Ricciardo | <unknown> | Christian Horner |  Australian |   driver
 Red Bull |      Kvyat | <unknown> | Christian Horner |     Russian |   driver

When we now wish to update a team we are able to do so with just one query.

UPDATE teammember_by_team SET location = ‘Milton Keynes’
WHERE teamname = ‘Red Bull’;
teamname  | membername | location      | manager          | nationality | position
----------+------------+---------------+------------------+-------------+----------
 Red Bull |  Ricciardo | Milton Keynes | Christian Horner |  Australian |   driver
 Red Bull |      Kvyat | Milton Keynes | Christian Horner |     Russian |   driver

Technical limitations

There are however a few limitations to using static columns. A table can only contain static columns when the table has at least one clustering key. This actually makes sense because if the table has no clustering key then all fields are stored within the partition. Next to this limitation, static columns cannot be used on tables using the COMPACT STORAGE option.

When we wish to read fields from only the one side of the one-to-many relation this requires a different approach compared to reading fields on a table without static columns. Review the following query:

SELECT teamname, manager, location
FROM teammember_by_team
WHERE teamname = ‘Red Bull’;

Without static columns these fields may be inconsistent and we will get back two rows. With static columns these values are always consistent but still we get back two rows. We get back two rows because there are two team members. If we wish to get back one row per team then all we have to do is add the DISTINCT keyword:

SELECT DISTINCT teamname, manager, location
FROM teammember_by_team
WHERE teamname = ‘Red Bull’;

Cassandra understands that the distinct keyword requires a special treatment because all fields queried are static (besides the partition key). The data can be returned faster compared to not using static columns because in the latter case the values actually have to be compared.

Static columns provide benefits for one-to-many relations. These benefits constrain a couple of great risks as described in the previous blogpost. The insert operations become easier and more flexible at the one side on the one-to-many relation. The update performed on the entity side of the one-to-many relation is reduced from many queries to just one query. Also select queries at the single entity side of the one-to-many relation are improved.

References

[1] Analysis of the impact of new use cases on a Cassandra data model and operations in code

[2] http://www.datastax.com/dev/blog/cql-in-2-0-6

[3] https://issues.apache.org/jira/browse/CASSANDRA-6561

[4] http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html

[5] http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use-batch-static.html

[6] http://cassandra.apache.org/doc/cql3/CQL.html#createTablepartitionClustering

[7] https://books.google.nl/books?id=n1nTBgAAQBAJ