How to Choose which Tombstones to Drop

Author: The Last Pickle

Tombstones are notorious for causing issues in Apache Cassandra. They often become a problem when Cassandra is not able to purge them in a timely fashion. Delays in purging happen because there are a number of conditions that must be met before a tombstone can be dropped.

In this post, we are going to see how to make meeting these conditions more likely. We are going to achieve this by selecting which specific SSTables Cassandra should include in a compaction, which results in smaller and faster compactions that are more likely drop the tombstones.

Before we start, there are a few parameters we need to note:

  • We will consider only cases where the unchecked_tombstone_compaction compaction option is turned off. Enabling this option makes Cassandra run compaction with only one SSTable. This achieves a similar result to our process, but in a far less controlled way.
  • We also assume the unsafe_aggressive_sstable_expiration option of TWCS is turned off. This option makes Cassandra drop entire SSTables once they expire without checking if the partitions appear in other SSTables.

How Cassandra Drops Tombstones

When we look at the source, we see Cassandra will consider deleting a tombstone when a SSTable undergoes a compaction. Cassandra can only delete the tombstone if:

  • The tombstone is older than gc_grace_seconds (a table property).
  • There is no other SSTable outside of this compaction that:
    • Contains a fragment of the same partition the tombstone belongs to, and
    • The timestamp of any value (in the other SSTable) is younger than the tombstone.

In other words, for a tombstone to delete there can not be any data that a tombstone suppresses outside of the planned compaction. Unfortunately, in many cases tombstones are not isolated and will touch other data.

The heavy weight solution to this issue is to run a major compaction. A major compaction will include all SSTables in one big compaction, so there are no SSTables not participating in it. However, this approach comes with a cost:

  • It can require at least 50% of the disk space to be free.
  • It consumes CPU and disk throughput at the expense of regular traffic.
  • It can take hours to compact TBs of data.

So use a more light-weight solution that will compact only the SSTables within a given partition, but no others.

Step 1 - Identify Problematic Partition

The first step is to find out which partition is the most problematic. There are various ways of doing this.

First, if we know our data model inside and out, we can tell straight away which partitions tombstone heavy. If it’s not obvious by looking at it, we can use one of the other options below.

For example, another option is to consult the Cassandra logs. When Cassandra encounters too many tombstones, it will log a line similar to this:

WARN  [SharedPool-Worker-4] 2019-07-15 09:24:15,971 - Read 2738 live and 6351 tombstone cells in tlp_stress.sensor_data for key: 55d44291-0343-4bb6-9ac6-dd651f543323 (see tombstone_warn_threshold). 5000 columns were requested, slices=[-]

Finally, we can use a tool like Instaclustr’s ic-purge to give us a detailed overview of the tombstone situation:

|         | Size    |
| Disk    | 36.0 GB |
| Reclaim |  5.5 GB |

Largest reclaimable partitions:
| Key              | Size     | Reclaim  | Generations                  |
|     001.0.361268 |  32.9 MB |  15.5 MB |               [46464, 62651] |
|     001.0.618927 |   3.5 MB |   1.8 MB |               [46268, 36368] |

In the table above, we see which partitions take the most reclaimable space (001.0.361268 takes the most). We also see which SSTables these partitions live in (the Generation column). We have found the SSTables to compact. However, we can take this one step further and ask Cassandra for the absolute paths, not just their generation numbers.

Step 2 - List Relevant SSTables

With the partition key known, we can simply use the nodetool getsstables command. It will make Cassandra tell us the absolute paths of SSTables that a partition lives in:

ccm node1 nodetool "getsstables tlp_stress sensor_data 001.0.361268"


After we find all the SSTables, the last thing we need to do is to trigger a compaction.

Step 3 - Trigger a Compaction

Triggering a user-defined compaction is something Jon has described in this post. We will proceed the same way and use jmxterm to trigger the forceUserDefinedCompaction MBean. We will need to pass it a comma-separated list of the SSTables we got in the previous step:


JMX_CMD="run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction ${SSTABLE_LIST}"
echo ${JMX_CMD} | java -jar jmxterm-1.0-alpha-4-uber.jar -l localhost:7100
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns:

Despite getting a null as the invocation result, the compaction has most likely started. We can go and watch the nodetool compactionstats to see how it is going.

Once the compaction completes, we can repeat the process we used in Step 1 above to validate that the tombstones have been deleted.

Note for LeveledCompactionStrategy: This procedure only works with STCS and TWCS. If a table is using LCS, Cassandra does not allow invoking the forceUserDefinedCompaction MBean. For LCS, we could nudge Cassandra into compacting specific SSTables by resetting their levels. That, however, is complicated enough to deserve its own blog post.


In this post we saw how to trigger compactions for all of the SSTables a partition appears in. This is useful because the compaction will have a smaller footprint and is more efficient than running a major compaction, but will reliably purge droppable tombstones that can cause issues for Apache Cassandra.

Originally Sourced from: