As explained “in extenso” by Alain in his installment on how Apache Cassandra deletes data, removing rows or cells from a table is done by a special kind of write called a tombstone. But did you know that inserting a null value into a field from a CQL statement also generates a tombstone? This happens because Cassandra cannot decide whether inserting a null value means that we are trying to void a field that previously had a value or that we do not want to insert a value for that specific field.
In this article, we will show examples of how this can happen, how to detect it, and what you can do to mitigate it.
Aggressive tombstone purging in Cassandra
Before we go into detail on how to avoid generating tombstones from inserting null columns, we will briefly discuss aggressive tombstone purging via single SSTable compactions, as we reference one of these settings in the rest of the article.
Single SSTable compactions can be triggered automatically by Cassandra to clear tombstones from SSTables even when they do not match standard SSTable compaction criteria. This feature is very important in reducing disk space issues and is triggered based on compaction parameters that are not well known to many Cassandra users.
Every compaction strategy has the following three parameters to control the single SSTable compaction process :
- tombstone_threshold: the percentage of the table which must be occupied by tombstones for a single SSTable compaction to occur (default 0.2)
- tombstone_compaction_interval: the number of seconds since the SSTable was created after which it can be considered for single SSTable compaction (default 86,400 which is one day)
- unchecked_tombstone_compaction: check for overlap with other SSTables before running a single SSTable compaction
We won’t be detailing the entire process here, but instead focus on the tombstone_threshold value and how this impacts cleaning up tombstones generated from null columns.
Using the DataStax Java Driver we’ll illustrate how we can generate tombstones and how they manifest themselves in SSTables.
Considering the following Cassandra table :
CREATE TABLE myks.mytable (
id int PRIMARY KEY,
value1 text,
value2 text,
value3 text
);
We’re going to insert a row using a prepared statement with the DataStax Java driver v3.0.0 :
Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
Session session = cluster.connect();
PreparedStatement prepStmt = session.prepare("INSERT INTO myks.mytable(id, value1, value2, value3) values(?, ?, ?, ?)");
BoundStatement boundStmt = prepStmt.bind(1, "value1", null);
session.execute(boundStmt);
session.close();
cluster.close();
Here we can see that field value2 as been explicitly set to null and field value3 has deliberately not been bound to any value.
After flushing the memtable using nodetool flush
on a Cassandra 3.0.8 CCM cluster and convert the resulting SSTable to JSON we get the following output :
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"liveness_info" : { "tstamp" : "2016-09-10T02:14:32.215Z" },
"cells" : [
{ "name" : "value1", "value" : "value1" },
{ "name" : "value2", "deletion_info" : { "local_delete_time" : "2016-09-10T02:14:32Z" }
}
]
}
]
}
]
The deletion_info
on the value2 field shows that we have a tombstone, while value3 is absent from the output.
The combination of Cassandra 2.2+ and DataStax Java Driver 3.0+ prevents unset bind parameters from generating tombstones, while binding null values actually generates tombstones.
Taking a look at our SSTable’s metadata, we can see that the estimated droppable tombstone ratio is already above the default threshold :
MacBook-Pro:mytable-1f1f9b2076fc11e69df3a5416e6d241c adejanovski$ sstablemetadata mb-1-big-Data.db | grep droppable
Estimated droppable tombstones: 0.25000048441443046
The problem here is that this cell tombstone will not get evicted until the whole row gets its very own tombstone, which will have to live more then the configured gc_grace_seconds to eventually be eligible for actual purge during the next compaction.
Using the default tombstone_threshold will generate an important write amplification here as the SSTable will be compacted and fully rewritten at each cycle until the row can actually be fully purged (and we won’t be covering partition fragmentation over multiple SSTables and timestamp overlaps that can further delay tombstone purges, for the sake of simplicity).
We can observe this behavior by running a major compaction on the table and checking the newly create SSTable size and estimated droppable tombstone ratio:
MacBook-Pro:mytable-1f1f9b2076fc11e69df3a5416e6d241c adejanovski$ sstablemetadata mb-2-big-Data.db | grep droppable
Estimated droppable tombstones: 0.25000135458119405
Checking the content by using sstabledump
shows that the tombstone is indeed still present as we get the exact same output.
The good news here is that we can avoid creating unnecessary tombstones by not binding parameters instead of setting them to a null value.
Running the same code on a 2.1 cluster will give a slightly different result though :
Exception in thread "main" java.lang.IllegalStateException: Unset value at index 3. If you want this value to be null, please set it to null explicitly.
at com.datastax.driver.core.BoundStatement.ensureAllSet(BoundStatement.java:1351)
at com.datastax.driver.core.SessionManager.makeRequestMessage(SessionManager.java:572)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:131)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:63)
at com.thelastpickle.java_driver.prepstmttest.App.main(App.java:23)
Indeed, the ability to avoid tombstone creation by not binding specific parameters needs a combination of the DataStax Java Driver 3.0.0 and Cassandra 2.2+.
Workarounds
If still running an earlier version of Cassandra, and at the time of this writing most production clusters may still be running Cassandra 2.0 or 2.1, there are a few options to save nodes from tombstone pollution.
If we can easily predict which fields are likely to be null, it is possible to prepare 2 or 3 statements that will cover different combinations of fields. This way we get to choose the statement that will not generate tombstones, or at least as few as possible based on the values to be bound.
If the number of field combinations gets too high the above technique is not an option, but we can still take advantage of batches to fix that issue.
While big batches, and especially multi partition batches, are generally not advised because of the pressure they put on coordinator nodes, small single partition batches are a very efficient way to group queries. In this case, we can generate a set of prepared statement for each field (or group of fields) and group their execution into batches :
Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
Session session = cluster.connect();
String value1 = "value1";
String value2 = null;
String value3 = null;
PreparedStatement prepStmt1 = session.prepare("INSERT INTO myks.mytable(id, value1) values(?, ?)");
PreparedStatement prepStmt2 = session.prepare("INSERT INTO myks.mytable(id, value2) values(?, ?)");
PreparedStatement prepStmt3 = session.prepare("INSERT INTO myks.mytable(id, value3) values(?, ?)");
List<BoundStatement> boundStatements = new ArrayList<>();
if(value1!=null){
boundStatements.add(prepStmt1.bind(2, value1));
}
if(value2!=null){
boundStatements.add(prepStmt2.bind(2, value2));
}
if(value3!=null){
boundStatements.add(prepStmt3.bind(2, value3));
}
final BatchStatement batch = new BatchStatement();
boundStatements.forEach(stmt -> batch.add(stmt));
session.execute(batch);
session.close();
cluster.close();
After running this code against our 2.1 cluster and flushing the SSTable we can observe that no tombstones were generated :
MacBook-Pro:mytable-b3a24e50770111e6bab5a9eb710f2b22 adejanovski$ sstablemetadata myks-mytable-ka-1-Data.db | grep droppable
Estimated droppable tombstones: 0.0
Batches are processed atomically on each partition, as a single operation against the database, making them as efficient as running a single query, but allowing us more flexibility in this specific use case to get a more tombstone efficient application.
If changing production code and deploying a new application is not possible in a decent amount of time, try to observe at which droppable tombstone ratio the SSTables actually get the ability to purge at least a part of their content.
If the data has TTL’ed, we should see an increase in the droppable tombstone ratio as the SSTables get older. Once that ratio gets determined, use an alter table statement to change the threshold (adjust the following to match the compaction strategy, in our case TWCS) :
ALTER TABLE myks.mytable
WITH compaction = {'class':'TimeWindowCompactionStrategy',
'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'tombstone_threshold':'0.5',
'unchecked_tombstone_compaction':true};
We recommend setting unchecked_tombstone_compaction
to true
as it will allow single SSTable compactions to run more often even if all droppable tombstones won’t necessarily get purged due to SSTables overlap. In any case, a safety check will be performed before any tombstone purge to prevent data loss.
Adjusting the threshold will prevent Cassandra from wasting I/O and CPU on useless compactions and still trigger them to reclaim disk space when possible.
Takeaways
Prepared statements are highly recommended when interacting with Cassandra as they remove some overhead compared to simple statements and effectively allow the driver to use the TokenAwarePolicy, which lightens the coordination footprint of the cluster.
On all combinations of Cassandra and Java driver versions, binding null values to prepared statement parameters will generate tombstones, and leaving unset bound parameters is only possible with Cassandra 2.2+ combined with the DataStax Java Driver 3.0.0+.
Those tombstones will force Cassandra to overestimate droppable tombstone ratios and make the use of aggressive tombstone purging strategies more complex to tune.
To efficiently detect this kind of behavior without a full codebase audit, use the sstablemetadata
command on newly flushed SSTables. They should have a ratio that is very close to 0 if deletes are not performed and the TTLs are longer than the time between the insert and the SSTable flush. With this information, you can then tweak the application code and table compaction settings in order to stop generating tombstones and allow the tombstone purge strategy to run efficiently.