Introduction
I have a database server that has these features:
- High available by design.
- Can be globally distributed.
- Allows applications to write to any node anywhere, anytime.
- Linearly scalable by simply adding more nodes to the cluster.
- Automatic workload and data balancing.
- A query language that looks a lot like SQL.
With the list of features above, why don’t we all use Cassandra for all our database needs? This is the hype I hear at conferences and from some commercial entities pushing their version of Cassandra. Unfortunately, some people believe it. Especially now when many users of proprietary database technologies like Oracle and SQL Server are looking to get out of massive license fees. The (apparent) low cost of open-source in combination with the list of features above, make Cassandra very attractive to many corporate CTOs and CFOs. What they are missing is the core features they assume a database has, but are missing from Cassandra.
I am a database architect and consultant. I have been working with Cassandra since version 0.7. came out in 2010.
I like, and often promote Cassandra to my customers—for the right use cases.
Unfortunately, I often find myself being asked to help after the choice was already made and it turned out to be a poor use case for Cassandra, or they made some poor choices in their data modeling for Cassandra.
In this blog post I am going to discuss some of the pitfalls to avoid, suggest a few good use cases for Cassandra and offer just a bit of data modeling advice.
Where Cassandra users go wrong
Cassandra projects tend to fail as a result of one or more of these reasons:
- The wrong Cassandra features were used.
- The use case was totally wrong for Cassandra.
- The data modeling was not done properly.
Wrong Features
To be honest, it doesn’t help that Cassandra has a bunch of features that probably shouldn’t be there. Features leading one to believe you can do some of the things everyone expects a relational database to do:
- Secondary indexes: They have their uses but not as an alternative access path into a table.
- Counters: They work most of the time, but they are very expensive and should not be used very often.
- Light weight transactions: They are not transactions nor are they light weight.
- Batches: Sending a bunch of operations to the server at one time is usually good, saves network time, right? Well in the case of Cassandra not so much.
- Materialized views: I got taken in on this one. It looked like it made so much sense. Because of course it does. But then you look at how it has to work, and you go…Oh no!
- CQL: Looks like SQL which confuses people into thinking it is SQL.
Using any of the above features the way you would expect them to work in a traditional database is certain to result in serious performance problems and in some cases a broken database.
Get your data model right
Another major mistake developers make in building a Cassandra database is making a poor choice for partition keys.
Cassandra is distributed. This means you need to have a way to distribute the data across multiple nodes. Cassandra does this by hashing a part of every table’s primary key called the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider the following rules when choosing you partition keys:
- There should be enough partition key values to spread the data for each table evenly across all the nodes in the cluster.
- Keep data you want to retrieve in single read within a single partition
- Don’t let partitions get too big. Cassandra can handle large partitions >100 Megabytes but its not very efficient. Besides, if you are getting partitions that large, it’s unlikely your data distribution will be even.
- Ideally all partitions would be roughly the same size. It almost never happens.
Typical real-world partition keys are user id, device id, account number etc. To manage partition size, often a time modifier like year and month or year are added to the partition key.
If you get this wrong, you will suffer greatly. I should probably point out that this is true in one way or another of all distributed databases. The key word here is distributed.
Wrong Use Cases for Cassandra
If you have a database where you depend on any of the following things– Cassandra is wrong for your use case. Please don’t even consider Cassandra. You will be unhappy.
- Tables have multiple access paths. Example: lots of secondary indexes.
- The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences.
- Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t.
- Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database.
- Joins: You many be able to data model yourself out of this one, but take care.
- Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty.
- Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious.
- Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.
If you are thinking about using Cassandra with any of the above requirements, you likely don’t have an appropriate use case. Please think about using another database technology that might better meet your needs.
When you should think about using Cassandra
Every database server ever designed was built to meet specific design criteria. Those design criteria define the use cases where the database will fit well and the use cases where it will not.
Cassandra’s design criteria are the following:
- Distributed: Runs on more than one server node.
- Scale linearly: By adding nodes, not more hardware on existing nodes.
- Work globally: A cluster may be geographically distributed.
- Favor writes over reads: Writes are an order of magnitude faster than reads.
- Democratic peer to peer architecture: No master/slave.
- Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.)
- Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal.
- Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.
There is nothing in the list about ACID, support for relational operations or aggregates. At this point you might well say, “what is it going to be good for?” ACID, relational and aggregates are critical to the use of all databases. No ACID means no Atomic and without Atomic operations, how do you make sure anything ever happens correctly–meaning consistently. The answer is you don’t. If you were thinking of using Cassandra to keep track of account balances at a bank, you probably should look at alternatives.
Ideal Cassandra Use Cases
It turns out that Cassandra is really very good for some applications.
The ideal Cassandra application has the following characteristics:
- Writes exceed reads by a large margin.
- Data is rarely updated and when updates are made they are idempotent.
- Read Access is by a known primary key.
- Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes.
- There is no need for joins or aggregates.
Some of my favorite examples of good use cases for Cassandra are:
- Transaction logging: Purchases, test scores, movies watched and movie latest location.
- Storing time series data (as long as you do your own aggregates).
- Tracking pretty much anything including order status, packages etc.
- Storing health tracker data.
- Weather service history.
- Internet of things status and event history.
- Telematics: IOT for cars and trucks.
- Email envelopes—not the contents.
Conclusion
Frequently, executives and developers look at the feature set of a technology without understanding the underlying design criteria and the methods used to implement those features. When dealing with distributed databases, it’s also very important to recognize how the data and workload will be distributed. Without understanding the design criteria, implementation and distribution plan, any attempt to use a distributed database like Cassandra is going to fail. Usually in a spectacular fashion.
Whether you’re considering an open source or commercial Cassandra deployment, planning to implement it, or already have it in production, Pythian’s certified experts can work with your team to ensure the success of your project at every phase. Learn more about Pythian Services for Cassandra.
Interested in working with John? Schedule a tech call.