Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

10/3/2018

Reading time:5 min

GDPR and Cassandra - DZone Security

by John Doe

delete /dɪˈliːt/ — verb: remove or obliterate (written or printed matter), especially by drawing a line through it. Schrödinger’s Data As we all know, GDPR will be in force May 2018. After that, users of software products and services will have the right to be forgotten (cool, right? Finally I can rest assured that my browsing history will not be read aloud at my funeral). In other words, if a user from the EU asks a service provider to delete their data, the provider will have to delete all the user’s data or face severe consequences. But, it is unclear what it means to delete a user’s data. I guess the only way to find out is when the audit occurs. A user’s data is simultaneously both deleted and not deleted until observed at the time of the audit. This post is the introduction to a series of blog posts about GDPR and Cassandra databases. Cassandra and Data Deletion As Cassandra consultants, our primary concern is: what does it mean to delete the data from Cassandra points of view? And what we can do to be as sure as possible that a user’s data will stay deleted. As we know, when Cassandra deletes the data, it just marks it as deleted. The actual “deletion” occurs during the compaction process. When Cassandra marks data as deleted: It can’t be fetched anymore using Cassandra’s query language (cql). The data still exists in Cassandra’s files on the disk (SSTables) but is flagged as deleted. The data is removed (for real) from SSTables when compaction occurs before the compaction evicts deleted data, the deleted data can still be accessed with specialized (forensic?) tools. Once again: Cassandra, like many other systems, does not actually delete data when it deletes the data. But this is in line with the definition of the verb delete from the Oxford dictionary: “remove or obliterate (written or printed matter), especially by drawing a line through it.” On the other hand, a similar thing happens in the underlying OS (Linux). When the OS deletes a file, it just marks it as deleted. And you can recover the deleted files with specialized forensic tools. Okay, so the actual, irreversible deleting of the data does not usually happen in the software engineering. But we would love to do as much as we can to make sure that the data is not accessible from Cassandra and any Cassandra tooling (like sstabledump, sstable2json). OS and file system engineers should do their part of work by doing the same for the OS level (if they think that’s necessary). The only way to make sure that the data stays deleted. Another problem in Cassandra is that it is hard to filter on fields that are not part of the primary key. So, if some of the user’s data is held in the table where the primary key is something like deviceId, that would mean that we would have to search all the records for all the deviceIds and remove the corresponding user’s data. That does not scale. Data Deletion and Compactions As already said, even after a delete statement is issued, it is not guaranteed that the data is deleted. Furthermore, if the data model is not well designed, the deleted data might never get evicted. In Cassandra 3.10, this behavior is improved, and compaction is triggered when there is a certain percent of expired tombstones (read more about it here), and deleting compaction strategy looks like it could solve this problem (note that the strategy is not an official part of Apache Cassandra). Also, I’m quite sure that I saw a Jira issue on an Apache Cassandra project about some other kind of Deleting compaction strategy, which should guarantee to actually delete the data, not only mark it as deleted, but I can’t find it now. That would be cool. Speaking of compaction strategies, SizeTieredCompactionStrategy can be tricky, because if you end up with one huge SSTable file, you need SSTables of a similar size to compact them. Which means that the tombstones will stay in a huge SSTable for a very long time; maybe forever. A situation similar to the one occurring in the 2048 game: Tile 2048 will not be merged anytime soon. The main takeaway is: be aware of how different compaction strategies work and know your system behavior. If you have a problem with tombstone eviction, it might be a good idea to change your compaction strategy and/or to redesign your tables Delete User Data That Is Not Part of the Primary Key Unlike in relational databases, in Cassandra data is stored in denormalized form. Thus, it is not possible to (easily) filter on fields that are not part of the partition key. So, if we have the following table: CREATE TABLE device_measurements ( device_id uuid, measurement_type text, measurement_value text, user_id uuid, PRIMARY KEY (device_id, measurement_type)); This means that we cannot just: DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b It is, however, possible to issue: DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b ALLOW FILTERING But this might ruin the performance of the entire cluster. Therefore, we should think about the user’s data in advance when designing the tables. Embracing Privacy by Design Solution 1: design tables in a way that the user’s data can be easily deleted (user_id part of the primary key) from all the tables. This solution will obviously have an impact on the design process in both greenfield projects and when redesigning existing databases. Solution 2: embrace encryption. Okay, this is not a production-ready solution, it’s more of an idea we’re currently playing with at SmartCat. Encrypting the stored user’s data with homomorphic encryption to preserve the ordering of clustering columns, and when the data needs to be deleted, just delete the key. If you have any thoughts on this or experience to share, we would love to hear from you. Conclusion Embrace Privacy by design. The idea of GDPR is a good thing from a consumer perspective. A user’s data will be seen as a liability for the companies, not as an asset, which means that companies will, hopefully, be cautious when storing a user’s data. GDPR is also an excellent opportunity for new players on a database as a service market (DaaS) or some derivative of the concept; it seems that it is easier to build new systems with privacy in mind from scratch than to refactor the existing ones. What I would like to see is a database (as a service) that would allow me to issue a delete for the userId, and for me (as a programmer/user of the database) to stop worrying about it. The DaaS provider would be responsible for the rest. What are your thoughts on this?

Illustration Image

delete /dɪˈliːt/ — verb: remove or obliterate (written or printed matter), especially by drawing a line through it.

Schrödinger’s Data

As we all know, GDPR will be in force May 2018. After that, users of software products and services will have the right to be forgotten (cool, right? Finally I can rest assured that my browsing history will not be read aloud at my funeral). In other words, if a user from the EU asks a service provider to delete their data, the provider will have to delete all the user’s data or face severe consequences.

But, it is unclear what it means to delete a user’s data. I guess the only way to find out is when the audit occurs.

The data is both deleted and not deleted until observed.
A user’s data is simultaneously both deleted and not deleted until observed at the time of the audit.

This post is the introduction to a series of blog posts about GDPR and Cassandra databases.

Cassandra and Data Deletion

As Cassandra consultants, our primary concern is: what does it mean to delete the data from Cassandra points of view? And what we can do to be as sure as possible that a user’s data will stay deleted. As we know, when Cassandra deletes the data, it just marks it as deleted. The actual “deletion” occurs during the compaction process.

When Cassandra marks data as deleted:

  • It can’t be fetched anymore using Cassandra’s query language (cql).
  • The data still exists in Cassandra’s files on the disk (SSTables) but is flagged as deleted.
  • The data is removed (for real) from SSTables when compaction occurs before the compaction evicts deleted data, the deleted data can still be accessed with specialized (forensic?) tools.

Once again: Cassandra, like many other systems, does not actually delete data when it deletes the data. But this is in line with the definition of the verb delete from the Oxford dictionary:

“remove or obliterate (written or printed matter), especially by drawing a line through it.”

On the other hand, a similar thing happens in the underlying OS (Linux). When the OS deletes a file, it just marks it as deleted. And you can recover the deleted files with specialized forensic tools.

Okay, so the actual, irreversible deleting of the data does not usually happen in the software engineering. But we would love to do as much as we can to make sure that the data is not accessible from Cassandra and any Cassandra tooling (like sstabledump, sstable2json). OS and file system engineers should do their part of work by doing the same for the OS level (if they think that’s necessary).

image
The only way to make sure that the data stays deleted.

Another problem in Cassandra is that it is hard to filter on fields that are not part of the primary key. So, if some of the user’s data is held in the table where the primary key is something like deviceId, that would mean that we would have to search all the records for all the deviceIds and remove the corresponding user’s data. That does not scale.

Data Deletion and Compactions

As already said, even after a delete statement is issued, it is not guaranteed that the data is deleted. Furthermore, if the data model is not well designed, the deleted data might never get evicted. In Cassandra 3.10, this behavior is improved, and compaction is triggered when there is a certain percent of expired tombstones (read more about it here), and deleting compaction strategy looks like it could solve this problem (note that the strategy is not an official part of Apache Cassandra). Also, I’m quite sure that I saw a Jira issue on an Apache Cassandra project about some other kind of Deleting compaction strategy, which should guarantee to actually delete the data, not only mark it as deleted, but I can’t find it now. That would be cool.

Speaking of compaction strategies, SizeTieredCompactionStrategy can be tricky, because if you end up with one huge SSTable file, you need SSTables of a similar size to compact them. Which means that the tombstones will stay in a huge SSTable for a very long time; maybe forever. A situation similar to the one occurring in the 2048 game:

2048

Tile 2048 will not be merged anytime soon.

The main takeaway is: be aware of how different compaction strategies work and know your system behavior. If you have a problem with tombstone eviction, it might be a good idea to change your compaction strategy and/or to redesign your tables

Delete User Data That Is Not Part of the Primary Key

Unlike in relational databases, in Cassandra data is stored in denormalized form. Thus, it is not possible to (easily) filter on fields that are not part of the partition key. So, if we have the following table:

CREATE TABLE device_measurements (
  device_id uuid,   
  measurement_type text,
  measurement_value text,   
  user_id uuid,   
  PRIMARY KEY (device_id, measurement_type));

This means that we cannot just:

DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b

It is, however, possible to issue:

DELETE FROM device_measurement WHERE user_id = bf884b98–0a72–10e8-ba89–0ed5f89f718b ALLOW FILTERING

But this might ruin the performance of the entire cluster.

Therefore, we should think about the user’s data in advance when designing the tables.

Embracing Privacy by Design

Solution 1: design tables in a way that the user’s data can be easily deleted (user_id part of the primary key) from all the tables. This solution will obviously have an impact on the design process in both greenfield projects and when redesigning existing databases.

Solution 2: embrace encryption. Okay, this is not a production-ready solution, it’s more of an idea we’re currently playing with at SmartCat. Encrypting the stored user’s data with homomorphic encryption to preserve the ordering of clustering columns, and when the data needs to be deleted, just delete the key. If you have any thoughts on this or experience to share, we would love to hear from you.

Conclusion

Embrace Privacy by design. The idea of GDPR is a good thing from a consumer perspective. A user’s data will be seen as a liability for the companies, not as an asset, which means that companies will, hopefully, be cautious when storing a user’s data. GDPR is also an excellent opportunity for new players on a database as a service market (DaaS) or some derivative of the concept; it seems that it is easier to build new systems with privacy in mind from scratch than to refactor the existing ones. What I would like to see is a database (as a service) that would allow me to issue a delete for the userId, and for me (as a programmer/user of the database) to stop worrying about it. The DaaS provider would be responsible for the rest.

What are your thoughts on this?

Related Articles

cassandra
ssl
security

Setting Up a Cassandra Cluster With SSL - DZone Cloud

Jean-Paul Azar

7/26/2022

cassandra
security

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra