Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/20/2018

Reading time:14 mins

Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cass…

by DataStax

Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cass… SlideShare Explore You Successfully reported this slideshow.Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016Upcoming SlideShareLoading in …5× 0 Comments 6 Likes Statistics Notes gregorypub Vijayakumar Ramdoss , Platform Architect at Dell Fabrice FACORAT , Linux system administrator, NoSQL and Big Data Expert at Agoda soloix Stavros Kontopoulos , Senior Software Engineer at Lightbend, PSM I, PSPO I, Phd Student at Lightbend Show More No DownloadsNo notes for slideEssentially tombstones will never go away as long as a partition contains data in more than one SSTable, sometimes not even then (bloom filter collisions)When you write to Cassandra, the writes initially go to Memtables. When the memtables get full, they flush to disk as an immutable SSTable When you perform a read, Cassandra needs to consider all the SSTables on disk, so as you accumulate lots of small SSTables, read performance will degrade What do you think will be in SSTable 4? Optimally it should be an empty table, the only record in it has been deleted. However…What happened? 1. Deletes WithoutTombstones or TTLsEric Stevens, Principal ArchitectProtectWise, Inc. 2. ©2016 ProtectWise, Inc. All rights reserved.About ProtectWiseAn enterprise security company that records, analyzes, and visualizes your network on demand to detectcomplex threats that others can’t seeBig DataData Ingestion and Availability● Well north of a billion new recordsper day● Processed, analyzed, and storedin soft real time● Fully indexed and searchable withp95 query response times <1second○ Shortening the OODA loop● Hundreds of Cassandra servers● Hundreds of Billions of Records● Multiple Petabytes of Data 3. ©2016 ProtectWise, Inc. All rights reserved.With one sensor, ProtectWise captured thefollowing data at Super Bowl 50:● 8.806 Terabytes of data seen. Primarily HTTP,SSL and traffic to Amazon AWS, Facebook,Twitter, and Instagram.● 1.550 Terabytes of data captured (82%optimization)● 17 million URLs hit● 8,085,949 DNS requestsWith a single sensor deployed on the Levi'sPublic Wi-Fi Network, ProtectWise captured8.806 Terabytes of Data and was able to optimizeit by 82% to just 1.550 Terabytes of data, a truetestament to the scale and power of our platform.Use Case – Super Bowl 50The Broncos weren’t the only team from Denver in Levi’s Stadium 4. ©2016 ProtectWise, Inc. All rights reserved.● How Deletes (tombstones) in Cassandra Work Today● The Limitations of Tombstones● Misconceptions about Tombstones● How TTL (Time to Live) in Cassandra works today● The limitations of TTLs● Why neither strategy works for ProtectWise● Our unconventional solution● Advantages of our solution● Disadvantages of our solutionOverview 5. ©2016 ProtectWise, Inc. All rights reserved.● Increases both write and read I/O pressure● Not an effective means of reclaiming diskcapacity● May be difficult to locate correct records fordeletion● Makes reads more expensive● Actual tombstones can often greatly outlivetheir deleted data (much longer thangc_grace)Terrible● Surgically target data for removal● Easy to reason about from a readconsistency perspectiveTerrificThe Trouble with Tombstones 6. ©2016 ProtectWise, Inc. All rights reserved.When do tombstones (and expired TTL’drecords) go away?● Never before it’s gc_grace old (this is a good thing, and you get to control it)● During compaction, for a tombstone past gc_grace, its partition key is checkedagainst the bloom filters of all other SSTables for the given CQL table.● If there is a bloom filter collision, the tombstone will remain, even if the bloomfilter collision was a false positive● If there is ANY data, even other tombstones for that partition in any SSTable,the tombstone will not get cleaned up● If bloom filters indicate there is no chance of overlap on that partition key, thetombstone will get cleaned up 7. ©2016 ProtectWise, Inc. All rights reserved.Misconception about Tombstone Performance● The performance degradation from tombstones isn’t from the tombstone itself.● If you do○ for (n <- 0 to 100000) {INSERT INTO table (partitionKey, clusterKey) VALUES ( 1, n )}● You can later create a range tombstone that is tiny bytes wise:○ DELETE FROM table WHERE partitionKey = 1 AND clusterKey < 99999● But if you then○ SELECT * FROM table WHERE partitionKey = 1 LIMIT 1● Cassandra will have to read then discard rows with clusterKey values from 0to 99998 before the LIMIT 1 can be reached 8. ©2016 ProtectWise, Inc. All rights reserved.PK1 CK1CK21 2 ... o1 2 ... p... ...CKn 1 2 ... qPK1 DELETE 1 – n-1SSTable 1SSTable 23SELECT * FROM table WHERE pk1 LIMIT 1 9. ©2016 ProtectWise, Inc. All rights reserved.{{{{Compaction Review↑ Writes← Older Data Newer Data → 10. ©2016 ProtectWise, Inc. All rights reserved.Tombstones in Compaction↑ DeleteSSTablecontainingrecord todelete ↑ 11. ©2016 ProtectWise, Inc. All rights reserved.Tombstones in Compaction↑ Other WritesSSTablecontainingrecord todelete ↑ 12. ©2016 ProtectWise, Inc. All rights reserved.Tombstones in Compaction↑ Other WritesSSTablecontainingrecord todelete ↑ 13. ©2016 ProtectWise, Inc. All rights reserved.Tombstones in Compaction↑ Other WritesSSTablecontainingrecord todelete ↑ 14. ©2016 ProtectWise, Inc. All rights reserved.Tombstones in Compaction↑ Other WritesFinallyDeleted ↑ 15. Showing why tombstones are not the same thing as a delete.Tombstone Demo 16. ©2016 ProtectWise, Inc. All rights reserved.Setupcqlsh> CREATE TABLE testing(… p blob,… c blob,… v blob,… PRIMARY KEY(p,c)… ) WITH gc_grace_seconds=0; 17. ©2016 ProtectWise, Inc. All rights reserved.Setupcqlsh> INSERT INTO testing(p,c,v) VALUES (0xcafebabe,0xdeadbeef, 0xdeadc0de);$ nodetool flush && ls *-Data.dbtesting-testing-ka-1-Data.dbtesting-testing-ka-2-Data.dbcqlsh> INSERT INTO testing(p,c,v) VALUES (0xcafebabe,0xdeadbeef, 0xfacefeed);$ nodetool flush && ls *-Data.dbtesting-testing-ka-1-Data.db0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xfacefeed10xcafebabe:0xdeadbeef:0xdeadc0de2 18. ©2016 ProtectWise, Inc. All rights reserved.Setupcqlsh> DELETE FROM testing WHEREp=0xcafebabe AND c=0xdeadbeef;$ nodetool flush && ls *-Data.dbtesting-testing-ka-1-Data.dbtesting-testing-ka-2-Data.dbtesting-testing-ka-3-Data.dbcqlsh> select * from testing;p | c | v------------+------------+------------0xcafebabe | 0xdeadbeef | 0xdeadc0de0xcafebabe:0xdeadbeef:0xfacefeed10xcafebabe:0xdeadbeef:0xdeadc0de20xcafebabe:0xdeadbeef:DELETE3 19. ©2016 ProtectWise, Inc. All rights reserved.Let’s look at the data$ hexdump testing-testing-ka-1-Data.db0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 800000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 340000020 3b d8 4e df f1 0d 00 14 0b 19 00 29 01 76 1a 000000030 70 04 fa ce fe ed 00 00 6f 9b 15 170xcafebabe:0xdeadbeef:0xfacefeed1 20. ©2016 ProtectWise, Inc. All rights reserved.Let’s look at the data$ hexdump testing-testing-ka-2-Data.db0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 800000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 340000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 000000030 70 04 de ad c0 de 00 00 62 de 14 020xcafebabe:0xdeadbeef:0xdeadc0de2 21. ©2016 ProtectWise, Inc. All rights reserved.Let’s look at the data$ hexdump testing-testing-ka-3-Data.db0000000 33 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 800000010 00 01 00 94 07 00 04 de ad be ef ff 10 0a 00 f00000020 00 01 57 4f 2d 69 00 05 34 3b e6 ab 47 c8 00 000000030 db 77 12 690xcafebabe:0xdeadbeef:DELETE3 22. ©2016 ProtectWise, Inc. All rights reserved.Time to CompactSimulate compactionhappening on data thathas been deleted, butwhere the tombstone isnot involved in thecompaction% jmx_invoke -morg.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction testing-testing-ka-1-Data.db,testing-testing-ka-2-Data.db$ ls *-Data.dbtesting-testing-ka-3-Data.dbtesting-testing-ka-4-Data.db0xcafebabe:0xdeadbeef:0xfacefeed10xcafebabe:0xdeadbeef:0xdeadc0de2 0xcafebabe:0xdeadbeef:??????????4 23. ©2016 ProtectWise, Inc. All rights reserved.Let’s look again:$ hexdump testing-testing-ka-4-Data.db0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 800000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 340000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 000000030 70 04 de ad c0 de 00 00 62 de 14 020xcafebabe:0xdeadbeef:0xdeadc0de4 24. ©2016 ProtectWise, Inc. All rights reserved.What happened?● The tombstone for primary key (0xcafebabe,0xdeadbeef) was written inSSTable 3● SSTable 3 wasn’t involved in the compaction● ∴The data at rest didn’t get cleaned up 25. ©2016 ProtectWise, Inc. All rights reserved.Why is this a problem● In all mainline compaction strategies:○ Data written close together chronologically tends to compact together relatively quickly○ Data written chronologically far apart tends to take a long time to compact together■ This is why it’s an anti-pattern to append or overwrite the same partition over longperiods of time, your reads to that partition will end up needing to read out of a largenumber of SSTables○ Because disk capacity is not recovered until the tombstone and its underlying data areinvolved in the same compaction, it can take a long time to recover disk capacity● Some compaction strategies (DateTiered, TimeWindowed) have controls thatallow for data to permanently stop compacting.○ Under these conditions there become times where it’s impossible to ever recover disk capacityNote, See CASSANDRA-7019 for an upcoming alternativeAlso “Improving Tombstone Compactions” today at 4:10 in 210C 26. ©2016 ProtectWise, Inc. All rights reserved.● Once a TTL has been written, there is noway to change your mind except to write therecord again with a new TTL● Rows written to more than one time mayhave inconsistent TTLs leading to dirty orincomplete reads.● TTL’d records may remain at rest muchlonger than you realize in somecircumstancesTrouble● Fire and forget, your data will “go away”fairly predictablyTerrificThe Trouble with TTLs 27. ©2016 ProtectWise, Inc. All rights reserved.● Customers get to change their mind about howlong they want us to retain their data● Changing TTL’s is expensive, both in terms ofI/O pressure, and temporarily doubling the sizeof your data at rest● Disks are cheap… lots of disks are not● Cassandra data at rest has an ongoing cost, ifa customer stops paying for it, we need to aswell● Timeliness of deletes is important● Sensitive data spillage means we need toremove some data quicklyWhy Neither Strategy Works for Us 28. Our Unconventional Solution 29. ©2016 ProtectWise, Inc. All rights reserved.● If you have hot swappable drives, this is alot easier, if not, you might have sometemporary downtime due to RF change.Step 2: Disconnect Drive● There are some weird anti-entropy cornercases that are solved if you disablereplicationStep 1: Set RF=1Basic StrategySuccessfully used to delete significant amounts of data with little to no performance impact 30. ©2016 ProtectWise, Inc. All rights reserved.Step 3 31. Deleting Compaction Strategy 32. ©2016 ProtectWise, Inc. All rights reserved.● Records are removed from the nextcompaction as soon as they should beevicted● If we need to recover capacity quickly wecan use user defined compaction toselectively target our oldest filesEvicting Compaction Strategy● During compaction, use deterministic logicto determine which records should beremoved● Prevent records from surviving thecompaction process● Clean up indexes at the time the record isremovedDelete While CompactingBasic StrategyFor real this time. 33. ©2016 ProtectWise, Inc. All rights reserved.● If you choose to, you can create a backupautomatically of the deleted records● Save yourself from deletion remorse● Incorrect deletion logic● Change of heart by you(r customer)● Move those records to cheaper storageBacking up your deletes● Acts as a parent strategy with yourpreferred child compaction strategy● Child strategy is responsible for sstableselection● You get the characteristics of your strategy,with the deletes of our strategyWrapping Compaction StrategyFeaturesDoes it support feature X of my preferred compaction strategy? 34. ©2016 ProtectWise, Inc. All rights reserved.● Configurable and extensible● Several provided implementations canbe reasonably surgically controlled byreading deletion rules out of a tableyou specify● Extend one of several base classes toprovide more sophisticated customlogic● Restoring backups● To restore accidentally deletedrecords, copy these files to the rightpath and do nodetool refresh● Or if your topology has changed youcan restore them with sstableloaderFeatures 35. ©2016 ProtectWise, Inc. All rights reserved.ALTER TABLE bar WITH compaction = {'class': 'DeletingCompactionStrategy','dcs_underlying_compactor':'LeveledCompactionStrategy','sstable_size_in_mb': 160};ALTER TABLE foo WITH compaction = {'class': 'DeletingCompactionStrategy','dcs_underlying_compactor':'SizeTieredCompactionStrategy','min_threshold': '2','max_threshold': '8'};A Wrapping Compaction StrategyDoesn’t change the fundamental characteristicsof your preferred compaction strategy 36. ©2016 ProtectWise, Inc. All rights reserved.Compaction’s Inner WorkingsCredit: DataStaxhttps://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html 37. ©2016 ProtectWise, Inc. All rights reserved.Compaction’s Inner WorkingsCredit: DataStaxhttps://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html{Compaction Strategyselects SSTablesReturns SSTableIterators 38. ©2016 ProtectWise, Inc. All rights reserved.Compaction’s Inner WorkingsCredit: DataStaxhttps://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html}FilteringSSTableIteratorsexclude data which should bedeleted, and also notifyIndexManager if appropriate toclean up associated indexes. 39. ©2016 ProtectWise, Inc. All rights reserved.Rules:A => ✓B => ✗C => ✓D => ✗E => ✓* if configured to backup convicted recordsAn Evicting Compaction StrategyRecords involved in compaction which are convicted do notsurvive into the newly compacted SSTableABCABDCDEACESSTable 1 SSTable 2 SSTable 3New SSTable Backup SSTable*BD 40. ©2016 ProtectWise, Inc. All rights reserved.● Compaction performance is often boundedby available write capacity● Fewer records surviving into the target tablereduces write pressure during compaction● Testing of records for conviction islightweight (depending on the complexity ofyour business logic), and mostly CPUboundOften Faster than Existing Compaction 41. ©2016 ProtectWise, Inc. All rights reserved.● Records past the deletion boundary maystill be visible to your application● You may get inconsistent reads forsuch records● Evicted records may resurrect temporarilydue to repair● They’ll end up in a new SSTable andwill evict again during the next autocompactionBoundary Consistency● Like all other baked in deletion options, diskcapacity is reclaimed only eventually● Old SSTables still tend not to compactvery frequently● However by triggering user definedcompaction, you can reclaim spaceimmediately without resorting to majorcompactionEventual DeletesLimitations 42. ©2016 ProtectWise, Inc. All rights reserved.● Read repair and in general any repair maycause a record to fully resurrect temporarily● Resurrected record will appear in theyoungest SSTables● Will disappear again when those newSSTables next compact (generally relativelyquickly for an active cluster)Repair = Resurrection● Logic for deletes needs to be deterministicor you’ll end up with consistency issues● Probably not a good idea to base anydeletion logic on anything outside of theprimary key except in narrow use casesRequires deletion determinismLimitations 43. ©2016 ProtectWise, Inc. All rights reserved.● Supports and tested against Cassandra 2.xseries● In 3.x the package and class nameschanged, needs to be ported● Tests are written in Scala, they cover a lotof surface area but would need to berewritten prior to contribution● Needs additional general purposeconvictors● Principally tested against STCS anddeserves better coverage for other childstrategiesCurrent Project Status 44. ©2016 ProtectWise, Inc. All rights reserved.https://github.com/protectwise/cassandra-utilAlso includes:● Our DataStax Driver Wrapper for Scala● Our CCM wrapper lib for automating unit tests in ScalaGitHubAvailability & Compatibility 45. www.protectwise.com/careers.htmlEspecially if you’re in Denver!Scala, Akka, Spark, Node, DevOpsWe’re Hiring! 46. ©2016 ProtectWise, Inc. All rights reserved. 47. Cold Storage that Isn’t GlacialTomorrow 10:45 Room LL20DUsing Approximate Data for Small,Insightful AnalyticsTomorrow 2:00 Room LL20ASee Our Other Talks Recommended PowerPoint: Designing Better SlidesOnline Course - LinkedIn Learning Learning Study SkillsOnline Course - LinkedIn Learning Teacher TipsOnline Course - LinkedIn Learning Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...DataStax C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...DataStax Optimizing Cassandra in AWSgreggulrich Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...DataStax KillrVideo: Data Modeling Evolved (Patrick McFadin, Datastax) | Cassandra Sum...DataStax A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...DataStax Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...DataStax About Blog Terms Privacy Copyright LinkedIn Corporation © 2018 Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cass…

Successfully reported this slideshow.

Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Summit 2016
Deletes Without
Tombstones or TTLs
Eric Stevens, Principal Architect
ProtectWise, Inc.
©2016 ProtectWise, Inc. All rights reserved.
About ProtectWise
An enterprise security company that records, analyzes, and ...
©2016 ProtectWise, Inc. All rights reserved.
With one sensor, ProtectWise captured the
following data at Super Bowl 50:
● ...
©2016 ProtectWise, Inc. All rights reserved.
● How Deletes (tombstones) in Cassandra Work Today
● The Limitations of Tombs...
©2016 ProtectWise, Inc. All rights reserved.
● Increases both write and read I/O pressure
● Not an effective means of recl...
©2016 ProtectWise, Inc. All rights reserved.
When do tombstones (and expired TTL’d
records) go away?
● Never before it’s g...
©2016 ProtectWise, Inc. All rights reserved.
Misconception about Tombstone Performance
● The performance degradation from ...
©2016 ProtectWise, Inc. All rights reserved.
PK1 CK1
CK2
1 2 ... o
1 2 ... p
... ...
CKn 1 2 ... q
PK1 DELETE 1 – n-1
SSTa...
©2016 ProtectWise, Inc. All rights reserved.
{
{
{
{
Compaction Review
↑ Writes
← Older Data Newer Data →
©2016 ProtectWise, Inc. All rights reserved.
Tombstones in Compaction
↑ Delete
SSTable
containing
record to
delete ↑
©2016 ProtectWise, Inc. All rights reserved.
Tombstones in Compaction
↑ Other Writes
SSTable
containing
record to
delete ↑
©2016 ProtectWise, Inc. All rights reserved.
Tombstones in Compaction
↑ Other Writes
SSTable
containing
record to
delete ↑
©2016 ProtectWise, Inc. All rights reserved.
Tombstones in Compaction
↑ Other Writes
SSTable
containing
record to
delete ↑
©2016 ProtectWise, Inc. All rights reserved.
Tombstones in Compaction
↑ Other Writes
Finally
Deleted ↑
Showing why tombstones are not the same thing as a delete.
Tombstone Demo
©2016 ProtectWise, Inc. All rights reserved.
Setup
cqlsh> CREATE TABLE testing(
… p blob,
… c blob,
… v blob,
… PRIMARY KE...
©2016 ProtectWise, Inc. All rights reserved.
Setup
cqlsh> INSERT INTO testing
(p,c,v) VALUES (0xcafebabe,
0xdeadbeef, 0xde...
©2016 ProtectWise, Inc. All rights reserved.
Setup
cqlsh> DELETE FROM testing WHERE
p=0xcafebabe AND c=0xdeadbeef;
$ nodet...
©2016 ProtectWise, Inc. All rights reserved.
Let’s look at the data
$ hexdump testing-testing-ka-1-Data.db
0000000 4b 00 0...
©2016 ProtectWise, Inc. All rights reserved.
Let’s look at the data
$ hexdump testing-testing-ka-2-Data.db
0000000 4b 00 0...
©2016 ProtectWise, Inc. All rights reserved.
Let’s look at the data
$ hexdump testing-testing-ka-3-Data.db
0000000 33 00 0...
©2016 ProtectWise, Inc. All rights reserved.
Time to Compact
Simulate compaction
happening on data that
has been deleted, ...
©2016 ProtectWise, Inc. All rights reserved.
Let’s look again:
$ hexdump testing-testing-ka-4-Data.db
0000000 4b 00 00 00 ...
©2016 ProtectWise, Inc. All rights reserved.
What happened?
● The tombstone for primary key (0xcafebabe,0xdeadbeef) was wr...
©2016 ProtectWise, Inc. All rights reserved.
Why is this a problem
● In all mainline compaction strategies:
○ Data written...
©2016 ProtectWise, Inc. All rights reserved.
● Once a TTL has been written, there is no
way to change your mind except to ...
©2016 ProtectWise, Inc. All rights reserved.
● Customers get to change their mind about how
long they want us to retain th...
Our Unconventional Solution
©2016 ProtectWise, Inc. All rights reserved.
● If you have hot swappable drives, this is a
lot easier, if not, you might h...
©2016 ProtectWise, Inc. All rights reserved.
Step 3
Deleting Compaction Strategy
©2016 ProtectWise, Inc. All rights reserved.
● Records are removed from the next
compaction as soon as they should be
evic...
©2016 ProtectWise, Inc. All rights reserved.
● If you choose to, you can create a backup
automatically of the deleted reco...
©2016 ProtectWise, Inc. All rights reserved.
● Configurable and extensible
● Several provided implementations can
be reaso...
©2016 ProtectWise, Inc. All rights reserved.
ALTER TABLE bar WITH compaction = {
'class': 'DeletingCompactionStrategy',
'd...
©2016 ProtectWise, Inc. All rights reserved.
Compaction’s Inner Workings
Credit: DataStax
https://docs.datastax.com/en/cas...
©2016 ProtectWise, Inc. All rights reserved.
Compaction’s Inner Workings
Credit: DataStax
https://docs.datastax.com/en/cas...
©2016 ProtectWise, Inc. All rights reserved.
Compaction’s Inner Workings
Credit: DataStax
https://docs.datastax.com/en/cas...
©2016 ProtectWise, Inc. All rights reserved.
Rules:
A => ✓
B => ✗
C => ✓
D => ✗
E => ✓
* if configured to backup convicted...
©2016 ProtectWise, Inc. All rights reserved.
● Compaction performance is often bounded
by available write capacity
● Fewer...
©2016 ProtectWise, Inc. All rights reserved.
● Records past the deletion boundary may
still be visible to your application...
©2016 ProtectWise, Inc. All rights reserved.
● Read repair and in general any repair may
cause a record to fully resurrect...
©2016 ProtectWise, Inc. All rights reserved.
● Supports and tested against Cassandra 2.x
series
● In 3.x the package and c...
©2016 ProtectWise, Inc. All rights reserved.
https://github.com/protectwise/cassandra-util
Also includes:
● Our DataStax D...
www.protectwise.com/careers.html
Especially if you’re in Denver!
Scala, Akka, Spark, Node, DevOps
We’re Hiring!
©2016 ProtectWise, Inc. All rights reserved.
Cold Storage that Isn’t Glacial
Tomorrow 10:45 Room LL20D
Using Approximate Data for Small,
Insightful Analytics
Tomorrow ...

Upcoming SlideShare

Loading in …5

×

  1. 1. Deletes Without Tombstones or TTLs Eric Stevens, Principal Architect ProtectWise, Inc.
  2. 2. ©2016 ProtectWise, Inc. All rights reserved. About ProtectWise An enterprise security company that records, analyzes, and visualizes your network on demand to detect complex threats that others can’t see Big DataData Ingestion and Availability ● Well north of a billion new records per day ● Processed, analyzed, and stored in soft real time ● Fully indexed and searchable with p95 query response times <1 second ○ Shortening the OODA loop ● Hundreds of Cassandra servers ● Hundreds of Billions of Records ● Multiple Petabytes of Data
  3. 3. ©2016 ProtectWise, Inc. All rights reserved. With one sensor, ProtectWise captured the following data at Super Bowl 50: ● 8.806 Terabytes of data seen. Primarily HTTP, SSL and traffic to Amazon AWS, Facebook, Twitter, and Instagram. ● 1.550 Terabytes of data captured (82% optimization) ● 17 million URLs hit ● 8,085,949 DNS requests With a single sensor deployed on the Levi's Public Wi-Fi Network, ProtectWise captured 8.806 Terabytes of Data and was able to optimize it by 82% to just 1.550 Terabytes of data, a true testament to the scale and power of our platform. Use Case – Super Bowl 50 The Broncos weren’t the only team from Denver in Levi’s Stadium
  4. 4. ©2016 ProtectWise, Inc. All rights reserved. ● How Deletes (tombstones) in Cassandra Work Today ● The Limitations of Tombstones ● Misconceptions about Tombstones ● How TTL (Time to Live) in Cassandra works today ● The limitations of TTLs ● Why neither strategy works for ProtectWise ● Our unconventional solution ● Advantages of our solution ● Disadvantages of our solution Overview
  5. 5. ©2016 ProtectWise, Inc. All rights reserved. ● Increases both write and read I/O pressure ● Not an effective means of reclaiming disk capacity ● May be difficult to locate correct records for deletion ● Makes reads more expensive ● Actual tombstones can often greatly outlive their deleted data (much longer than gc_grace) Terrible ● Surgically target data for removal ● Easy to reason about from a read consistency perspective Terrific The Trouble with Tombstones
  6. 6. ©2016 ProtectWise, Inc. All rights reserved. When do tombstones (and expired TTL’d records) go away? ● Never before it’s gc_grace old (this is a good thing, and you get to control it) ● During compaction, for a tombstone past gc_grace, its partition key is checked against the bloom filters of all other SSTables for the given CQL table. ● If there is a bloom filter collision, the tombstone will remain, even if the bloom filter collision was a false positive ● If there is ANY data, even other tombstones for that partition in any SSTable, the tombstone will not get cleaned up ● If bloom filters indicate there is no chance of overlap on that partition key, the tombstone will get cleaned up
  7. 7. ©2016 ProtectWise, Inc. All rights reserved. Misconception about Tombstone Performance ● The performance degradation from tombstones isn’t from the tombstone itself. ● If you do ○ for (n <- 0 to 100000) { INSERT INTO table (partitionKey, clusterKey) VALUES ( 1, n ) } ● You can later create a range tombstone that is tiny bytes wise: ○ DELETE FROM table WHERE partitionKey = 1 AND clusterKey < 99999 ● But if you then ○ SELECT * FROM table WHERE partitionKey = 1 LIMIT 1 ● Cassandra will have to read then discard rows with clusterKey values from 0 to 99998 before the LIMIT 1 can be reached
  8. 8. ©2016 ProtectWise, Inc. All rights reserved. PK1 CK1 CK2 1 2 ... o 1 2 ... p ... ... CKn 1 2 ... q PK1 DELETE 1 – n-1 SSTable 1 SSTable 2 3 SELECT * FROM table WHERE pk1 LIMIT 1
  9. 9. ©2016 ProtectWise, Inc. All rights reserved. { { { { Compaction Review ↑ Writes ← Older Data Newer Data →
  10. 10. ©2016 ProtectWise, Inc. All rights reserved. Tombstones in Compaction ↑ Delete SSTable containing record to delete ↑
  11. 11. ©2016 ProtectWise, Inc. All rights reserved. Tombstones in Compaction ↑ Other Writes SSTable containing record to delete ↑
  12. 12. ©2016 ProtectWise, Inc. All rights reserved. Tombstones in Compaction ↑ Other Writes SSTable containing record to delete ↑
  13. 13. ©2016 ProtectWise, Inc. All rights reserved. Tombstones in Compaction ↑ Other Writes SSTable containing record to delete ↑
  14. 14. ©2016 ProtectWise, Inc. All rights reserved. Tombstones in Compaction ↑ Other Writes Finally Deleted ↑
  15. 15. Showing why tombstones are not the same thing as a delete. Tombstone Demo
  16. 16. ©2016 ProtectWise, Inc. All rights reserved. Setup cqlsh> CREATE TABLE testing( … p blob, … c blob, … v blob, … PRIMARY KEY(p,c) … ) WITH gc_grace_seconds=0;
  17. 17. ©2016 ProtectWise, Inc. All rights reserved. Setup cqlsh> INSERT INTO testing (p,c,v) VALUES (0xcafebabe, 0xdeadbeef, 0xdeadc0de); $ nodetool flush && ls *-Data.db testing-testing-ka-1-Data.db testing-testing-ka-2-Data.db cqlsh> INSERT INTO testing (p,c,v) VALUES (0xcafebabe, 0xdeadbeef, 0xfacefeed); $ nodetool flush && ls *-Data.db testing-testing-ka-1-Data.db 0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xdeadc0de2
  18. 18. ©2016 ProtectWise, Inc. All rights reserved. Setup cqlsh> DELETE FROM testing WHERE p=0xcafebabe AND c=0xdeadbeef; $ nodetool flush && ls *-Data.db testing-testing-ka-1-Data.db testing-testing-ka-2-Data.db testing-testing-ka-3-Data.db cqlsh> select * from testing; p | c | v ------------+------------+------------ 0xcafebabe | 0xdeadbeef | 0xdeadc0de 0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xdeadc0de2 0xcafebabe:0xdeadbeef:DELETE3
  19. 19. ©2016 ProtectWise, Inc. All rights reserved. Let’s look at the data $ hexdump testing-testing-ka-1-Data.db 0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80 0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34 0000020 3b d8 4e df f1 0d 00 14 0b 19 00 29 01 76 1a 00 0000030 70 04 fa ce fe ed 00 00 6f 9b 15 17 0xcafebabe:0xdeadbeef:0xfacefeed1
  20. 20. ©2016 ProtectWise, Inc. All rights reserved. Let’s look at the data $ hexdump testing-testing-ka-2-Data.db 0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80 0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34 0000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 00 0000030 70 04 de ad c0 de 00 00 62 de 14 02 0xcafebabe:0xdeadbeef:0xdeadc0de2
  21. 21. ©2016 ProtectWise, Inc. All rights reserved. Let’s look at the data $ hexdump testing-testing-ka-3-Data.db 0000000 33 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80 0000010 00 01 00 94 07 00 04 de ad be ef ff 10 0a 00 f0 0000020 00 01 57 4f 2d 69 00 05 34 3b e6 ab 47 c8 00 00 0000030 db 77 12 69 0xcafebabe:0xdeadbeef:DELETE3
  22. 22. ©2016 ProtectWise, Inc. All rights reserved. Time to Compact Simulate compaction happening on data that has been deleted, but where the tombstone is not involved in the compaction % jmx_invoke -m org.apache.cassandra.db:type=CompactionMan ager forceUserDefinedCompaction testing- testing-ka-1-Data.db,testing-testing-ka-2- Data.db $ ls *-Data.db testing-testing-ka-3-Data.db testing-testing-ka-4-Data.db 0xcafebabe:0xdeadbeef:0xfacefeed1 0xcafebabe:0xdeadbeef:0xdeadc0de2 0xcafebabe:0xdeadbeef:??????????4
  23. 23. ©2016 ProtectWise, Inc. All rights reserved. Let’s look again: $ hexdump testing-testing-ka-4-Data.db 0000000 4b 00 00 00 c3 00 04 ca fe ba be 7f ff ff ff 80 0000010 00 01 00 72 0a 00 04 de ad be ef 0e 00 71 05 34 0000020 3b e3 86 df 23 0d 00 14 0b 19 00 29 01 76 1a 00 0000030 70 04 de ad c0 de 00 00 62 de 14 02 0xcafebabe:0xdeadbeef:0xdeadc0de4
  24. 24. ©2016 ProtectWise, Inc. All rights reserved. What happened? ● The tombstone for primary key (0xcafebabe,0xdeadbeef) was written in SSTable 3 ● SSTable 3 wasn’t involved in the compaction ● ∴The data at rest didn’t get cleaned up
  25. 25. ©2016 ProtectWise, Inc. All rights reserved. Why is this a problem ● In all mainline compaction strategies: ○ Data written close together chronologically tends to compact together relatively quickly ○ Data written chronologically far apart tends to take a long time to compact together ■ This is why it’s an anti-pattern to append or overwrite the same partition over long periods of time, your reads to that partition will end up needing to read out of a large number of SSTables ○ Because disk capacity is not recovered until the tombstone and its underlying data are involved in the same compaction, it can take a long time to recover disk capacity ● Some compaction strategies (DateTiered, TimeWindowed) have controls that allow for data to permanently stop compacting. ○ Under these conditions there become times where it’s impossible to ever recover disk capacity Note, See CASSANDRA-7019 for an upcoming alternative Also “Improving Tombstone Compactions” today at 4:10 in 210C
  26. 26. ©2016 ProtectWise, Inc. All rights reserved. ● Once a TTL has been written, there is no way to change your mind except to write the record again with a new TTL ● Rows written to more than one time may have inconsistent TTLs leading to dirty or incomplete reads. ● TTL’d records may remain at rest much longer than you realize in some circumstances Trouble ● Fire and forget, your data will “go away” fairly predictably Terrific The Trouble with TTLs
  27. 27. ©2016 ProtectWise, Inc. All rights reserved. ● Customers get to change their mind about how long they want us to retain their data ● Changing TTL’s is expensive, both in terms of I/O pressure, and temporarily doubling the size of your data at rest ● Disks are cheap… lots of disks are not ● Cassandra data at rest has an ongoing cost, if a customer stops paying for it, we need to as well ● Timeliness of deletes is important ● Sensitive data spillage means we need to remove some data quickly Why Neither Strategy Works for Us
  28. 28. Our Unconventional Solution
  29. 29. ©2016 ProtectWise, Inc. All rights reserved. ● If you have hot swappable drives, this is a lot easier, if not, you might have some temporary downtime due to RF change. Step 2: Disconnect Drive ● There are some weird anti-entropy corner cases that are solved if you disable replication Step 1: Set RF=1 Basic Strategy Successfully used to delete significant amounts of data with little to no performance impact
  30. 30. ©2016 ProtectWise, Inc. All rights reserved. Step 3
  31. 31. Deleting Compaction Strategy
  32. 32. ©2016 ProtectWise, Inc. All rights reserved. ● Records are removed from the next compaction as soon as they should be evicted ● If we need to recover capacity quickly we can use user defined compaction to selectively target our oldest files Evicting Compaction Strategy ● During compaction, use deterministic logic to determine which records should be removed ● Prevent records from surviving the compaction process ● Clean up indexes at the time the record is removed Delete While Compacting Basic Strategy For real this time.
  33. 33. ©2016 ProtectWise, Inc. All rights reserved. ● If you choose to, you can create a backup automatically of the deleted records ● Save yourself from deletion remorse ● Incorrect deletion logic ● Change of heart by you(r customer) ● Move those records to cheaper storage Backing up your deletes ● Acts as a parent strategy with your preferred child compaction strategy ● Child strategy is responsible for sstable selection ● You get the characteristics of your strategy, with the deletes of our strategy Wrapping Compaction Strategy Features Does it support feature X of my preferred compaction strategy?
  34. 34. ©2016 ProtectWise, Inc. All rights reserved. ● Configurable and extensible ● Several provided implementations can be reasonably surgically controlled by reading deletion rules out of a table you specify ● Extend one of several base classes to provide more sophisticated custom logic ● Restoring backups ● To restore accidentally deleted records, copy these files to the right path and do nodetool refresh ● Or if your topology has changed you can restore them with sstableloader Features
  35. 35. ©2016 ProtectWise, Inc. All rights reserved. ALTER TABLE bar WITH compaction = { 'class': 'DeletingCompactionStrategy', 'dcs_underlying_compactor': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160 }; ALTER TABLE foo WITH compaction = { 'class': 'DeletingCompactionStrategy', 'dcs_underlying_compactor': 'SizeTieredCompactionStrategy', 'min_threshold': '2', 'max_threshold': '8' }; A Wrapping Compaction Strategy Doesn’t change the fundamental characteristics of your preferred compaction strategy
  36. 36. ©2016 ProtectWise, Inc. All rights reserved. Compaction’s Inner Workings Credit: DataStax https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html
  37. 37. ©2016 ProtectWise, Inc. All rights reserved. Compaction’s Inner Workings Credit: DataStax https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html { Compaction Strategy selects SSTables Returns SSTableIterators
  38. 38. ©2016 ProtectWise, Inc. All rights reserved. Compaction’s Inner Workings Credit: DataStax https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_write_path_c.html } FilteringSSTableIterators exclude data which should be deleted, and also notify IndexManager if appropriate to clean up associated indexes.
  39. 39. ©2016 ProtectWise, Inc. All rights reserved. Rules: A => ✓ B => ✗ C => ✓ D => ✗ E => ✓ * if configured to backup convicted records An Evicting Compaction Strategy Records involved in compaction which are convicted do not survive into the newly compacted SSTable A B C A B D C D E A C E SSTable 1 SSTable 2 SSTable 3 New SSTable Backup SSTable* B D
  40. 40. ©2016 ProtectWise, Inc. All rights reserved. ● Compaction performance is often bounded by available write capacity ● Fewer records surviving into the target table reduces write pressure during compaction ● Testing of records for conviction is lightweight (depending on the complexity of your business logic), and mostly CPU bound Often Faster than Existing Compaction
  41. 41. ©2016 ProtectWise, Inc. All rights reserved. ● Records past the deletion boundary may still be visible to your application ● You may get inconsistent reads for such records ● Evicted records may resurrect temporarily due to repair ● They’ll end up in a new SSTable and will evict again during the next auto compaction Boundary Consistency ● Like all other baked in deletion options, disk capacity is reclaimed only eventually ● Old SSTables still tend not to compact very frequently ● However by triggering user defined compaction, you can reclaim space immediately without resorting to major compaction Eventual Deletes Limitations
  42. 42. ©2016 ProtectWise, Inc. All rights reserved. ● Read repair and in general any repair may cause a record to fully resurrect temporarily ● Resurrected record will appear in the youngest SSTables ● Will disappear again when those new SSTables next compact (generally relatively quickly for an active cluster) Repair = Resurrection ● Logic for deletes needs to be deterministic or you’ll end up with consistency issues ● Probably not a good idea to base any deletion logic on anything outside of the primary key except in narrow use cases Requires deletion determinism Limitations
  43. 43. ©2016 ProtectWise, Inc. All rights reserved. ● Supports and tested against Cassandra 2.x series ● In 3.x the package and class names changed, needs to be ported ● Tests are written in Scala, they cover a lot of surface area but would need to be rewritten prior to contribution ● Needs additional general purpose convictors ● Principally tested against STCS and deserves better coverage for other child strategies Current Project Status
  44. 44. ©2016 ProtectWise, Inc. All rights reserved. https://github.com/protectwise/cassandra-util Also includes: ● Our DataStax Driver Wrapper for Scala ● Our CCM wrapper lib for automating unit tests in Scala GitHub Availability & Compatibility
  45. 45. www.protectwise.com/careers.html Especially if you’re in Denver! Scala, Akka, Spark, Node, DevOps We’re Hiring!
  46. 46. ©2016 ProtectWise, Inc. All rights reserved.
  47. 47. Cold Storage that Isn’t Glacial Tomorrow 10:45 Room LL20D Using Approximate Data for Small, Insightful Analytics Tomorrow 2:00 Room LL20A See Our Other Talks

Related Articles

spring
angular
rest

GitHub - jhipster/jhipster-sample-app-cassandra: This is a sample application created with JHipster, with the Cassandra option

jhipster

3/7/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra