Experiences with Tombstones in Apache Cassandra

6/8/2020
Reading time: 4 mins
cassandra

Experiences with Tombstones in Apache Cassandra

Tombstones can be detected in logs and by monitoring specific metrics. Their lifespan is defined by the table setting gc_grace_seconds.

Tune table settings

  • gc_grace_seconds

The default setting for the gc_grace_seconds is 864000 (10 days). For us it made sense to lower that value to 345600 (4 days). In case you decide to change the grace period, make sure to run repairs within that time range.

You can simply alter the value for existing tables or create a new table like this:

CREATE TABLE keyspace.tablename WITH GC_GRACE_SECONDS = 345600;

ALTER TABLE keyspace.tablename WITH GC_GRACE_SECONDS = 345600;

After TTL passes the defined time range, the rows are deleted and the values are turned into tombstones. By default the rows/values do not expire.

While experiencing the tombstone issues we figured out that we needed to increase the time-to-live of our tables. By having this value set to 1 ½ days, failing WRITE jobs had the effect of deleting all the data from a table at once. Due to the missing updates, every row was marked with a tombstone. Setting the value higher allows us to rerun jobs in time and avoid the unnecessary creation of a large amount of tombstones.

  • tombstone_warn_threshold

The tombstone threshold warning is set to 1.000 by default and it will become visible in your logs in case more than 1.000 tombstones are scanned by a single query.

  • tombstone_failure_threshold

The tombstone failure threshold is reached when 100.000 tombstones are scanned by a single query. In this case the query will be aborted, meaning data from the affected table can no longer be queried.

Those threshold settings can be changed, but be aware that if you increase them it can lead to nodes running out of memory.

Run regular repairs

Cassandra comes with high fault tolerance and operates well even if a node is temporarily unavailable. Nevertheless, it is best practice to run nodetool repair after a node failure. Especially if you want to avoid data loss at all costs.

During the repair process, the node that was unavailable will receive the writes that happened during its downtime.

It’s a good approach to run repairs within the gc_grace_seconds and on a regular basis, otherwise you might experience cases where deleted data is resurrected.

There are a couple of ways to run repairs automatically. One of them is Cassandra Reaper. This tool comes with a nice UI that makes it easy to schedule repairs.

By Cassandra Reaper

Configure and run compaction

To finally remove tombstones and free up disk space, compaction needs to be triggered. There are different kinds of compactions. The minor compaction executes compaction over all sstables and runs automatically in Cassandra. It removes all tombstones that are older than the grace period. If you trigger a compaction manually over all sstables it is called major compaction. Keep in mind that compaction needs to be triggered for each node individually. You also have the option to run a user defined compaction just on a group of specified sstables.

Compacting data in C* usually means merging sstables into a single new sstable. When performing reads on multiple rows, the more sstables need to be scanned, the slower the process gets. To prevent this from happening, the compaction needs to run on a regular basis.

To start compactions manually you can use nodetool compact. In case you don’t specify a keyspace or table, the compaction will run on all keyspaces and tables.

nodetool compact keyspace tablename

This command starts a compaction on the specified table. You should figure out which compaction strategy is configured for your tables and compare it with the alternate strategies. If you want to find out if another strategy meets your needs better, you can check out the DataStax documentation.

Monitor table metrics

1. Nodetool

Nodetool is Cassandra’s command line interface. It comes in handy if you want to figure out more about your clusters performance. You can check out the whole list of possible commands provided by DataStax.

Nodetool commands need to be executed for each node individually. We run C* on Kubernetes so we can open a shell for the running Cassandra pod as follows:

kubectl exec -it cassandra-0 -n recommend bash

From that point we can use nodetool to get various information about our tables. The command nodetool tablestats can be performed to print statistics for individual keyspaces or tables:

The values for ‘Average tombstones per slice’ and ‘Maximum tombstones per slice’ will tell you how many tombstones were scanned per query in the last five minutes. If you see this value being very high you might need to run compaction more often for that table.

2. SSTable Tools

Before executing SSTable Tools Cassandra needs to be stopped. The sstableutil command will give you a list of sstable files existing for a certain table. Executing sstablemetadata on these files, will provide you useful information about the specified table, like for example the TTL and an estimate of droppable tombstones.

3. Prometheus and Grafana

Our monitoring for C* is set up with Prometheus and Grafana. After receiving the tombstones threshold warnings we decided to add a panel to our dashboard that gives us an idea of the tombstone behavior for each of our tables.

Cassandra has a table metric called ‘TombstoneScannedHistogram’, which presents the histogram of tombstones that where scanned per query. By showing the 99th percentile we will see if something is not right.