Over the last few years, we’ve spent a lot of time reviewing clusters, answering questions on the Cassandra User mailing list, and helping folks out on IRC. During this time, we’ve seen the same issues come up again and again. While we regularly go into detail on individual topics, and encourage folks to read up on the fine print later, sometimes teams just need a starting point to help them get going. This is our basic tuning checklist for those teams who just want to get a cluster up and running and avoid some early potholes.
Adjust The Number of Tokens Per Node
When the initial work on vnodes in Cassandra was started on the mailing list, 256 tokens was chosen in order to ensure even distribution. This carried over into CASSANDRA-4119 and became part of the implementation. Unfortunately, there were several unforeseen (and unpredictable) consequences of choosing a number this high. Without spending a lot of time on the subject, 256 can cause issues with bootstrapping new nodes (lots of SSTables), repair takes longer, and CPU usage is overall higher. Since Cassandra 3.0, we’ve had the ability to allocate tokens in a more predictable manner that avoids hotspots and leverages the advantage of vnodes. We thus recommend using a value of 4 here, which gives a significant improvement in token allocation. In order to use this please ensure you’re using the
allocate_tokens_for_keyspace setting in cassandra.yaml file. It’s important to do this up front as there’s not an easy way to change this without setting up a new data center and doing migration.
Note: Using 4 tokens won’t allow you to add a single node to a cluster of 100 nodes and have each node get an additional 1% capacity. In practice there’s not much benefit from doing that, as you would always want to expand a cluster by roughly 25% in order to have a noticeable improvement, and 4 tokens allows for that.
Configure Racks, Snitch, and Replication
Cassandra has the ability to place data around the ring in a manner that ensures we can survive losing a rack, an availability zone, or entire data center. In order to do this correctly, it needs to know where each node is placed, so that it can place copies of data in a fault-tolerant manner according to the replication strategy. Production workloads should ALWAYS use the
NetworkTopologyStrategy, which takes the racks and data centers into account when writing data. We recommend using
GossipingPropertyFileSnitch as a default. If you’re planning on staying within a cloud provider it’s probably easier to use the dedicated snitch, such as the
EC2Snitch, as they figure out their rack and data center automatically. These should all be set up before using the cluster for production as changing it is extremely difficult.
Set up Internode Encryption
We discuss how to do this in our guide on setting up internode encryption. I won’t go deep into the details here since Nate did a good job of that in his post. This falls in the “do it up front or you’ll probably never do it” list with #1 and #2.
Set up client authentication.
Using authentication for your database is a good standard practice, and pretty easy to set up initially. We recommend disabling the Cassandra user altogether once auth is set up, and increasing the replication factor (RF) of the
system_auth keyspace to a few nodes per rack. For example, if you have 3 racks, use RF=9 for
system_auth. It’s common to read folks advising increasing the
system_auth to match the cluster size no matter how big it is, I’ve found this to be unnecessary. It’s also problematic when you increase your cluster size. If you’re using 1000 nodes per data center, you’ll need a higher setting, but it’s very unlikely your first cluster will be that big. Be sure to bump up the
credentials_validity_in_ms settings (try 30 seconds) to avoid hammering those nodes with auth requests.
Using authentication is a great start, but adding a layer of authorization to specific tables is a good practice as well. Lots of teams use a single Cassandra cluster for multiple purposes, which isn’t great in the long term but is fine for starting out. Authorization lets you limit what each user can do on each keyspace and that’s a good thing to get right initially. Cassandra 4.0 will even have the means of restricting users to specific data centers which can help segregate different workloads and reduce human error.
The Security section of the documentation is worth reading for more details.
Disable Dynamic Snitch
Dynamic snitch is a feature that was intended to improve the performance of reads by preferring nodes which are performing better. Logically, it makes quite a bit of sense, but unfortunately it doesn’t behave that way in practice. In reality, we suffer from a bit of the Observer Effect, where the observation of the situation affects the outcome. As a result, the dynamic snitch generates quite a bit of garbage, so much in fact that using it makes everything perform significantly worse. By disabling it, we make the cluster more stable overall and end up with a net reduction in performance related problems. A fix for this is actively being worked on for Cassandra 4.0 in CASSANDRA-14459.
Set up client encryption
If you’re going to set up authentication, it’s a good idea to set up client encryption as well, or else everything (including authentication credentials) is sent over cleartext.
Again, the Security section of the documentation is worth reading for more details on this, I won’t rehash what’s there.
Increase counter cache (if using counters)
Counters do a read before write operation. The counter cache allows us to skip reading the value off disk. In counter heavy clusters we’ve seen a significant performance improvement by increasing the counter cache. We’ll dig deeper into this in a future post and include some performance statistics as well as graphs.
Set up sub range repair (with Reaper)
Incremental repair is a great idea but unfortunately has been the cause of countless issues, which we discussed in our blog previously. Cassandra 4.0 should fix the remaining issues with incremental repair by changing the way anti compaction works.
Read up on the details in CASSANDRA-9143
Without good monitoring it’s just not possible to make good decisions on a single server, and the problem is compounded when dealing with an entire cluster. There’s a number of reasonable monitoring solutions available. At the moment we’re fans of Prometheus, if you chose to self host, and Datadog if you prefer hosted, but this isn’t meant to be an exhaustive list. We recommend aggregating metrics from your application and your databases into a single monitoring system.
Once you’ve got all your metrics together, be sure to create useful dashboards that expose:
- Error rate
These metrics must be monitored from all layers to truly understand what’s happening when there’s an issue in production.
Going an extra step or two past normal monitoring is distributed tracing, made popular by the Google Dapper paper. The open source equivalent is Zipkin, which Mick here at The Last Pickle is a contributor to and wrote a bit about some time back.
Cassandra’s fault tolerance makes it easy to (incorrectly) disregard the usual advice of having a solid backup strategy. Just because data is replicated doesn’t mean we might not need a way to recover data later on. This is why we always recommend having a backup strategy in place.
There are hosted solutions like Datos, Cassandra Snapshots, volume snapshots (LVM, EBS Volumes, etc), incremental Cassandra backups, home rolled tools, etc. It’s not possible to recommend a single solution for every use case. The right solution is going to be dependent on your requirements.
Basic GC Tuning
The default Cassandra JVM argument is, for the most part, unoptimized for virtually every workload. We’ve written a post on GC tuning so we won’t rehash here. Most people assume GC tuning is an exercise in premature optimization and are quite surprised when they can get a 2-5x improvement in both throughput (queries per second) and p99 latency.
Disable Materialized Views
When materialized views (MVs) were added to Cassandra 3 everyone, including me, was excited. It was one of the most interesting features added in a long time and the possibility of avoiding manual denormalization was very exciting. Unfortunately, writing such a feature to work correctly turned out to be extremely difficult as well.
Since the release of the feature, materialized views have retroactively been marked as experiemental, and we don’t recommend them for normal use. Their use makes it very difficult (or impossible) to repair and bootstrap new nodes into the cluster. We’ll dig deeper into this issue in a later post. For now, we recommend setting the following in
We wrote a post a while back discussing tuning compression and how it can help improve performance, especially on read heavy workloads.
Dial Back Read Ahead
Generally speaking, whenever data is read off disk, it’s put in the page cache (there are exceptions). Accessing data in the page cache is significantly faster than accessing data off disk, since it’s reading from memory. Read ahead simply reads extra data. The logic is that adding extra data to a read is cheap, since the request is already being made, so we might as well get as much data in the page cache as we can.
There’s a big problem with this.
First, if read ahead is requesting information that’s not used, it’s a waste of resources to pull extra data off disk. Pulling extra data off disk means the disk is doing more work. If we had a small amount of data, that might be fine. The more data we have, the higher the ratio of disk space to memory, and that means it keeps getting less and less likely we’ll actually have the data in memory. In fact, we end up finding it likely we’ll only use some of the data, some of the time.
The second problem has to do with page cache churn. If you have 30GB of page cache available and you’re accessing 3TB of data, you end up pulling a lot of data into your page cache. The old data needs to be evicted. This process is an incredible waste of resources. You can tell if this is happening on your system if you see a
kswapd process taking up a lot of CPU, even if you have swap disabled.
We recommend setting this to 0 or 8 sectors (4KB) on local SSD, and 16 (KB) if you’re using EBS in AWS and you’re performing reads off large partitions.
blockdev --setra 8
You’ll want to experiment with the read ahead setting to find out what works best for your environment.
There’s quite a bit to digest, so I’ll stop here. This isn’t a comprehensive list of eveything that can be adjusted by any means, but it’s a good start if you’re looking to do some basic tuning. We’ll be going deeper into some of the topics in later posts!