This post is the introduction to a series of blog posts about GDPR and Cassandra databases.
As Cassandra consultants, our primary concern is: what does it mean to delete the data from Cassandra points of view? And what we can do to be as sure as possible that a user’s data will stay deleted. As we know, when Cassandra deletes the data, it just
marks it as deleted. The actual “deletion” occurs during the compaction process.
Once again: Cassandra, like many other systems, does not actually delete data when it
deletes the data. But this is in line with the definition of the verb delete from the Oxford dictionary:
On the other hand, a similar thing happens in the underlying OS (Linux). When the OS deletes a file, it just marks it as deleted. And you can
recover the deleted files with specialized forensic tools.
Okay, so the actual, irreversible deleting of the data does not usually happen in the software engineering. But we would love to do as much as we can to make sure that the data is not accessible from Cassandra and any Cassandra tooling (like
sstabledump, sstable2json). OS and file system engineers should do their part of work by doing the same for the OS level (if they think that’s necessary).
The only way to make sure that the data stays deleted.
Another problem in Cassandra is that it is hard to filter on fields that are not part of the primary key. So, if some of the user’s data is held in the table where the primary key is something like
deviceId, that would mean that we would have to search all the records for all the deviceIds and remove the corresponding user’s data. That does not scale. Data Deletion and Compactions
As already said, even after a delete statement is issued, it is not guaranteed that the data is deleted. Furthermore, if the data model is not well designed, the deleted data might never get evicted. In Cassandra 3.10, this behavior is improved, and compaction is triggered when there is a certain percent of expired tombstones (read more about it
here), and deleting compaction strategy looks like it could solve this problem (note that the strategy is not an official part of Apache Cassandra). Also, I’m quite sure that I saw a Jira issue on an Apache Cassandra project about some other kind of Deleting compaction strategy, which should guarantee to actually delete the data, not only mark it as deleted, but I can’t find it now. That would be cool.
Speaking of compaction strategies,
SizeTieredCompactionStrategy can be tricky, because if you end up with one huge SSTable file, you need SSTables of a similar size to compact them. Which means that the tombstones will stay in a huge SSTable for a very long time; maybe forever. A situation similar to the one occurring in the 2048 game:
Tile 2048 will not be merged anytime soon.
The main takeaway is: be aware of how different compaction strategies work and know your system behavior. If you have a problem with tombstone eviction, it might be a good idea to change your compaction strategy and/or to redesign your tables
Delete User Data That Is Not Part of the Primary Key
Unlike in relational databases, in Cassandra data is stored in denormalized form. Thus, it is not possible to (easily) filter on fields that are not part of the partition key. So, if we have the following table:
CREATE TABLE device_measurements (
PRIMARY KEY (device_id, measurement_type));
This means that we cannot just:
DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b
It is, however, possible to issue:
DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b ALLOW FILTERING
But this might ruin the performance of the entire cluster.
Therefore, we should think about the user’s data in advance when designing the tables.
Embracing Privacy by Design
Solution 1: design tables in a way that the user’s data can be easily deleted (user_id part of the primary key) from all the tables. This solution will obviously have an impact on the design process in both greenfield projects and when redesigning existing databases.
Solution 2: embrace encryption. Okay, this is not a production-ready solution, it’s more of an idea we’re currently playing with at
SmartCat. Encrypting the stored user’s data with homomorphic encryption to preserve the ordering of clustering columns, and when the data needs to be deleted, just delete the key. If you have any thoughts on this or experience to share, we would love to hear from you. Conclusion
Embrace Privacy by design. The idea of GDPR is a good thing from a consumer perspective. A user’s data will be seen as a liability for the companies, not as an asset, which means that companies will, hopefully, be cautious when storing a user’s data. GDPR is also an excellent opportunity for new players on a database as a service market (DaaS) or some derivative of the concept; it seems that it is easier to build new systems with privacy in mind from scratch than to refactor the existing ones. What I would like to see is a database (as a service) that would allow me to issue a delete for the userId, and for me (as a programmer/user of the database) to stop worrying about it. The DaaS provider would be responsible for the rest.
What are your thoughts on this?