Long ago Apache Cassandra added collections as a data type. That is, list, set, and map are supported native data types. Collections have evolved and improved over time and today there are what are called “frozen” and “non-frozen” collections. Frozen collections are serialized as a single value internally and that value can only be updated in its entirety. Non-frozen collections are serialized as separate fields internally. If you don’t explicitly say that your collection is frozen, it is non-frozen and fields can be individually updated.
For the purpose of this post, I will create a simple table:
create table foo.bar (
id int,
name text,
stuff set<text>,
PRIMARY KEY (id)
);
There is one caveat to non-frozen collections. In order to initialize the collection, it creates a tombstone (or deletion marker) to clear out any previous collection. For example:
insert into foo.bar (id, name, stuff) values (1, 'forest', {'maple', 'oak'});
In this case, it will create a tombstone for the stuff
collection to clear it out before adding ‘maple’ and ‘oak’. This can be confirmed by running nodetool flush
to flush the memtable to disk and running sstabledump
on the sstable. For example:
sstabledump $BAR_DATA_DIR/md-1-big-Data.db
[
{
"partition" : {
"key" : [ "1" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 45,
"liveness_info" : { "tstamp" : "2020-10-29T01:45:12.252697Z" },
"cells" : [
{ "name" : "name", "value" : "forest" },
{ "name" : "stuff", "deletion_info" : { "marked_deleted" : "2020-10-29T01:45:12.252696Z", "local_delete_time" : "2020-10-29T01:45:12Z" } },
{ "name" : "stuff", "path" : [ "maple" ], "value" : "" },
{ "name" : "stuff", "path" : [ "oak" ], "value" : "" }
]
}
]
}
]%
You can see deletion_info
for stuff
before it adds ‘maple’ and ‘oak’ to the set. It writes a tombstone to initialize the collection because Cassandra tries to be idempotent for most writes. That is, if you retry that write, it will end up with the same value. That way, it doesn’t have to read before writing.
A lot has been written about tombstones and Cassandra. Without care, tombstones can lead to problems. However in our case, an initial tombstone to clear it out isn’t a huge deal. It will get compacted away after a time. Some may say “if I set stuff
again, it will create another tombstone” but again its purpose is to blindly update and reset the value of stuff
. For example:
update foo.bar set stuff = {'maple', 'oak', 'elm'};
This statement will create a new tombstone to reset the set. However, if the original tombstone for stuff
still exists, the new tombstone will take precedence and the old one can be safely compacted away.
The reason why these are not a big deal has to do with access patterns. When tombstones get you into trouble is when they are scanned into the server memory as part of a read request. If you scan in loads of tombstones only to return a small number of rows, it negatively affects the server (the warn threshold for a single read is 1000 tombstones). However, typically you just read a single row from the database; in our case we might run select * from foo.bar where id = 1;
and we get back the full row. In that case we scan over one tombstone. Even when selecting large swaths of data at a time, like in an analytics job, the signal (returned data) to noise (tombstones) ratio remains fixed so it’s unlikely to cause a problem.
Still it would be nice to avoid any sort of tombstone. It turns out that you can when you do idempotent updates. Sets and maps lend themselves to idempotent updates because of their nature. Sets don’t contain duplicates so re-adding ‘maple’ multiple times will do nothing. Maps have explicit field names, so re-adding the same field again will do nothing.
Therefore instead of inserting, always use an update with sets and maps. For example instead of an initial insert, do the following:
update foo.bar set name = 'forest', stuff += {'maple', 'oak', 'elm'} where id = 1;
Note the stuff +=
which implies that it’s adding to the set, but the set is not yet initialized. Internally the set will get created with the three values and no tombstone.
For lists, while this method does avoid the tombstone, it will be additive as lists are not idempotent. For example if you add ‘maple’ twice (or more likely if the write appears to fail and the application or driver retries), it will be in the list twice [‘maple’, ‘maple’].
While tombstones can be intimidating, the tombstones generated from initializing collections aren’t going to impact most use cases. Further, you can avoid them completely with sets and maps by always updating as described in this post.