If there is one thing to understand about Cassandra, it is the fact that it is optimised for writes. In Cassandra everything is a write including logical deletion of data which results in tombstones – special deletion records. We have noticed that lack of understanding of tombstones is often the root cause of production issues our clients experience with Cassandra. We have decided to share a compilation of the most common problems with Cassandra tombstones and some practical advice on solving them.
Tombstones are a mechanism which allows Cassandra to write fast but it has an operational price to pay. First of all, tombstones are themselves records. They take up space and can substantially increase the amount of storage you require. Secondly, querying tables with a large number of tombstones causes performance problems explained below.
Large Number of Tombstones Causes Latency and Heap Pressure
The cassandra.yaml
comments explain in perfectly: “When executing a scan, within or across a partition, we need to keep the tombstones seen in memory so we can return them to the coordinator, which will use them to make sure other replicas also know about the deleted rows. With workloads that generate a lot of tombstones, this can cause performance problems and even exhaust the server heap.”
For illustration purposes consider the following table with a composite partition key of 2 columns, 2 clustering columns and 2 text columns:
CREATE TABLE sandbox.row_tests ( first_key INT, second_key INT, first_cc INT, second_cc INT, real_col_1 text, real_col_2 text, PRIMARY KEY ((first_key, second_key), first_cc, second_cc)) WITH CLUSTERING ORDER BY (first_cc ASC, second_cc ASC);
A row with all values set in JSON representation is as follows:
{"key": "499646:499646", "cells": [["499646:499646:","",1471333048921002], ["499646:499646:real_col_1","test_value_1",1471333048921002], ["499646:499646:real_col_2","test_value_2",1471333048921002]]}
If you delete real_col_1
, the row will have a server timestamp, client timestamp and “d”
flag which marks the column deleted:
{"key": "499646:499646", "cells": [["499646:499646:","",1471333933922005], ["499646:499646:real_col_1",1471333933,1471333933922005,"d"], ["499646:499646:real_col_2","test_value_2",1471333933922005]]}
If you delete the entire row, the tombstone will be as follows:
[ {"key": "499646:499646", "metadata": {"deletionInfo": {"markedForDeleteAt":1471335025231530,"localDeletionTime":1471335025}}, "cells": []} ]
This gives you a rough estimate of how much extra data Cassandra loads into memory when reading tombstones along the way on executing a query. For a simple single partition query consider roughly a 1.5ms penalty for every 1000 tombstones encountered (performance may vary).
“How do we get tombstones without deleting anything?”
One common misconception is that tombstones only appear when the client issues DELETE
statements to Cassandra. Some developers assume that it is safe to choose a way of operations which relies on Cassandra being completely tombstone free. In reality there are other many other things causing tombstones apart from issuing DELETE
statements. Inserting null
values, inserting collections and expiring data using TTL are common sources of tombstones. There is also a type of tombstones currently invisible to Cassandra tools and metrics – column range tombstones, so you may actually have a tombstone problem without realising you have tombstones at all.
Inserting null values
It is worth repeating that Cassandra is optimised for fast writes: it will not read data to check a condition before writing – that would slow writes down (Lightweight Transactions are one exception – they will read before writing at a cost of a 4x latency). As a result, even if the first value you ever write to a column is null
, Cassandra will not check to see if that value already exists; it will simply mark the column with a tombstone for deletion. The statement below will result in a tombstone for value_2
even if it is the first insertion for key=1
. The tombstone is highlighted in the JSON representation:
INSERT INTO a_table (KEY, value_1, value_2) VALUES (1, 'some_value', NULL);
[ {"key": "1", "cells": [["","",1470836653466124], ["value_1","some_value",1470836653466124], ["value_2",1470836653,1470836653466124,"d"]]} ]
To avoid this, do not insert null
values, use unset columns instead. In the previous example, we can leave col_2
unset instead of setting it to null
in CQL. This is allowed as long as col_2
is not part of the primary key.
INSERT INTO a_table (KEY, col_1) VALUES (1, 'some_value');
When using the Java driver, you can either omit the column name you are not setting when issuing an INSERT
or UPDATE
statement or use the unset()
method (see documentation) on bound statements to explicitly alter it:
PreparedStatement statement = session.prepare("INSERT INTO sandbox.a_table (key, col_2) values (?,?)"); BoundStatement bs = statement.bind(1, "test_value_1", "test_value_2"); bs.unset("col_1");
Inserting values into collection columns
Using Cassandra collections inevitably results in tombstones even if you never delete a value. Again, this is the result of Cassandra being optimised for writes. Consider the following table with list, set and map columns:
CREATE TABLE collection_table( KEY INT PRIMARY KEY, col_1 list, col_2 SET, col_3 map<INT, text> );
When you insert the first row into this table with collection values set you will see three tombstones recorded in an SSTable for collection_table
:
INSERT INTO collection_table(KEY, col_1, col_2, col_3) VALUES (1, ['one', 'two'], {'three', 'four'}, {5 : 'five', 6 : 'six'});
[ {"key": "1", "cells": [["","",1471340404406403], ["col_1:_","col_1:!",1471340404406402,"t",1471340404], ["col_1:68fd3380639511e6ae93f196435f2a32","6f6e65",1471340404406403], ["col_1:68fd3381639511e6ae93f196435f2a32","74776f",1471340404406403], ["col_2:_","col_2:!",1471340404406402,"t",1471340404], ["col_2:666f7572","",1471340404406403], ["col_2:7468726565","",1471340404406403], ["col_3:_","col_3:!",1471340404406402,"t",1471340404], ["col_3:00000005","66697665",1471340404406403], ["col_3:00000006","736978",1471340404406403]]} ]
Cassandra optimises for writes and does not check if the list has changed (or even existed), instead, it immediately deletes its before inserting the new one. Be aware of this when choosing to use collections as column types.
Expiring Data with TTL
Expiring data by setting a TTL (Time To Live) is one an alternative to deleting data explicitly but technically results in the same tombstones recorded by Cassandra and requiring the same level of attention as other types of tombstones.
The Invisible Column Range Tombstones
Issuing an explicit delete statement for a range of rows will result in a column range tombstone. Consider the following table:
CREATE TABLE sandbox.range_tests ( KEY INT, c_column INT, VALUE text, PRIMARY KEY (KEY, c_column) ) WITH CLUSTERING ORDER BY (c_column ASC);
Even with no data present, issuing the following delete statement will result in a corresponding range tombstone.
DELETE FROM range_tests WHERE KEY=1 AND c_column=2;
[ {"key": "1", "cells": [["2:_","2:!",1471890568707244,"t",1471890568]]} ]
This is a normal column range tombstone whose format tells you that a whole collection for partition key 1 and clustering column 2 has been deleted. The problem with tombstones like this is that they are not accounted for in Cassandra metrics or diagnostic tools. The will not contribute to the number of tombstones in a tracing summary or average numbers of tombstones in nodetool tablestats. At the time of this writing it is a known problem and a ticket is raised for this in Cassandra JIRA. For example, selecting by key 1
and c_column 2
with tracing enabled will not report any tombstones encountered even though we know there is one. Note that tombstone warning / failure alerts set in cassandra.yaml
will not detect column range tombstones either; this is also a known issue. For example, see the following tracing output:
Even though column range tombstones are not currently logged or reported in Cassandra, they have exactly the same implications as other types of tombstones and should be avoided where possible. We suggest using the following tools and approaches to detect problems caused by tombstones.
Look into “Raw” SSTable Content for Answers
When uncertain about performance issues and suspecting tombstones or not knowing what generates tombstones, a good starting point is to inspect the contents of SSTables for the offending table. Use sstable2json (Cassandra < 3.0) or sstabledump (Cassandra >= 3.0) to dump a JSON representation of an SSTable to console or file. For example:
tools/bin/sstable2json data/data/sandbox/a_table-f79f7890639111e6ae93f196435f2a32/lb-1-big-Data.db > ~/Documents/a_table.txt
This is especially useful for distinguishing between null
values for columns and columns which haven’t been set. For example, consider the following insert statements:
INSERT INTO a_table(KEY, col_1, col_2) VALUES (444, 'some_value', NULL);
INSERT INTO a_table(KEY, col_1) VALUES (555, 'some_value');
Statement 1 will generate a tombstone for col_2
, statement 2 will not, it will not create column col_2
for key 555 at all. You can see it in the JSON representation of the SSTable:
[ {"key": "444", "cells": [["","",1471347746607634], ["col_1","some_value",1471347746607634], ["col_2",1471347746,1471347746607634,"d"]]}, {"key": "555", "cells": [["","",1471347765127626], ["col_1","some_value",1471347765127626]]} ]
However, the data is displayed identically in CQL format, you will not be able to find tombstones caused by null
values by looking at SELECT
output:
KEY | col_1 | col_2 -----+------------+------- 444 | some_value | NULL
KEY | col_1 | col_2 -----+------------+------- 555 | some_value | NULL
It is also the only way to confirm existence of column range tombstones which are not logged by Cassandra and accordingly are not reported by any tools and metrics.
Set the appropriate gc_grace_seconds
on tables
Routine repairs must be run on clusters where deletions occur (they may occur even if you don’t explicitly delete anything, see above) to avoid among other things deleted data becoming live again. You must run repairs more often than the minimum chosen gc_grace_period
(see this article for more details). Make sure you are capable of supporting repairs more frequent than the minimum gc_grace_seconds
among all your tables.
Changing the tombstone warning / failure threshold
Invisible column range tombstones aside, there are two tombstone threshold settings in cassandra.yaml
helpful for detecting a large number of tombstones affecting performance:
– tombstone_warn_threshold
(default: 1000): if the number of tombstones scanned by a query exceeds this number Cassandra will log a warning (which will likely be propagating to your monitoring system and send you an alert).
– tombstone_failure_threshold
(default: 10000): if the number of tombstones scanned by a query exceeds this number Cassandra will abort the query. The is a mechanism to prevent one or more nodes from running out of memory and crashing.
These values should only be changed upwards if you are really confident about the memory use patterns in your cluster.
Conclusion
Tombstones are among the most misunderstood features of Cassandra and can cause significant performance problems if not investigated, monitored and dealt with in a timely manner. OpenCredo has acquired expertise in detecting and solving tombstone related problems on various projects by successfully using the tools shared above.