Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/13/2018

Reading time:4 min

More Efficient Repairs in 2.1

by John Doe

Repairs are important for every Cassandra cluster, especially when frequently deleting data. Running the nodetool repair command initiates the repair process on a specific node which in turn computes a Merkle tree for each range of data on that node. The merkle tree is a binary tree of hashes used by Cassandra for calculating the differences in datasets between nodes in a cluster. Every time a repair is carried out, the tree has to be calculated, each node that is involved in the repair has to construct its merkle tree from all the sstables it stores making the calculation very expensive. This allows for repairs to be network efficient as only targeted rows identified by the merkle tree as inconsistencies are sent across the network.Scanning every sstable to allow for the creation of merkle trees is an expensive operation. To avoid the need for constant tree construction incremental repairs are being introduced in Cassandra 2.1. The idea is to persist already repaired data, and only calculate merkle trees for sstables that haven’t previously undergone repairs allowing the repair process to stay performant and lightweight even as datasets grow so long as repairs are run frequently.Incremental repairs begin with the repair leader sending out a prepare message to its peers. Each node builds a merkle tree from the unrepaired sstables, which it can distinguish by the new repairedAt field in each sstable’s metadata. Once the leader receives a merkle tree from each node, it compares the trees and issues streaming requests, just as in the classic repair case. Finally, the leader issues an anticompaction command. Anticompaction is the process of segregating repaired and unrepaired ranges into separate sstables; repaired sstables are written with a new repairedAt field denoting the time of repair. Since sstable are not locked against compaction during the repair, they might get removed via compaction before the process completes. This costs us some efficiency, since they will be repaired again later, but does not harm correctness.Compaction with incremental repairsMaintaining separate pools of repaired and unrepaired sstables causes some extra complexity for compaction to deal with. For example, in the diagram below we repair a range covering half of the initial sstable. After repair, anticompaction splits it into a set of repaired and unrepaired sstables, at which point leveled and size-tiered compaction strategies handle segregation of the data differently.Size-Tiered compaction takes a simple approach of splitting repaired and unrepaired sstables into separate pools, each of which is compacted independently. Leveled compaction simply performs size-tiered compaction on unrepaired data, moving into the proper levels after repair. This cuts down on write amplification compared to maintaining two leveling pools.Migrating to incremental repairsFull repairs remain the default, largely so Cassandra doesn't have to guess the repaired state of existing sstables. Guessing that everything is fully repaired is obviously problematic; guessing that nothing is repaired is less obviously so: LCS would start size-tiering everything, since that is what it does now with unrepaired data! To avoid this, compaction remains unchanged until incremental repair is first performed and compaction detects sstables with the repairedAt flag.Incremental repairs can be opted into via the -inc option to nodetool repair. This is compatible with both sequential and parallel (-par) repair, e.g., bin/nodetool -par -inc <ks> <cf>. When an sstable is fully covered by a repaired range, no anticompaction will occur, it will just rewrite the repairedAt field in sstable metadata. Recovering from missing data or corrupted sstables will require a non-incremental full repair.  (For more on anticompaction, see Marcus's post here.)Effect of tools / commands on repair statusSince the sstable’s repair status is now tracked via it’s metadata, understanding how the set of tools provided with open-source Cassandra can impact this repair status becomes important.Bulk Loading - even if repaired in a different cluster, loaded tables will be unrepaired.Scrubbing - if scrubbing results in dropping rows, new sstables will be become unrepaired, however if no bad rows are detected, the sstable will keep its original repairedAt field.Major compaction - STCS will combine each of its pools into a single sstable, one repaired and one not. Major compaction continues to have no effect under LCS.Setting Repaired Status - a new tool added in 2.1 beta 2 can be found in tools/bin/sstablerepairedset that allows users to mark an sstable as repaired manually allowing for an easy migration to using incremental repairs by using the sstablerepairedset --is-repaired <sstable> command. It's important to only use this tool on repaired sstables, the status of an sstable can be checked via the /tools/bin/sstablemetadata tool by looking at the repairedAt field.DataStax has many ways for you to advance in your career and knowledge. You can take free classes, get certified, or read one of our many white papers.register for classesget certifiedDBA's Guide to NoSQLSubscribe for newsletter:

Illustration Image

Repairs are important for every Cassandra cluster, especially when frequently deleting data. Running the nodetool repair command initiates the repair process on a specific node which in turn computes a Merkle tree for each range of data on that node. The merkle tree is a binary tree of hashes used by Cassandra for calculating the differences in datasets between nodes in a cluster. Every time a repair is carried out, the tree has to be calculated, each node that is involved in the repair has to construct its merkle tree from all the sstables it stores making the calculation very expensive. This allows for repairs to be network efficient as only targeted rows identified by the merkle tree as inconsistencies are sent across the network.

Scanning every sstable to allow for the creation of merkle trees is an expensive operation. To avoid the need for constant tree construction incremental repairs are being introduced in Cassandra 2.1. The idea is to persist already repaired data, and only calculate merkle trees for sstables that haven’t previously undergone repairs allowing the repair process to stay performant and lightweight even as datasets grow so long as repairs are run frequently.

Tree construction comparison between regular and incremental repairs

Incremental repairs begin with the repair leader sending out a prepare message to its peers. Each node builds a merkle tree from the unrepaired sstables, which it can distinguish by the new repairedAt field in each sstable’s metadata. Once the leader receives a merkle tree from each node, it compares the trees and issues streaming requests, just as in the classic repair case. Finally, the leader issues an anticompaction command. Anticompaction is the process of segregating repaired and unrepaired ranges into separate sstables; repaired sstables are written with a new repairedAt field denoting the time of repair. Since sstable are not locked against compaction during the repair, they might get removed via compaction before the process completes. This costs us some efficiency, since they will be repaired again later, but does not harm correctness.

Compaction with incremental repairs

Maintaining separate pools of repaired and unrepaired sstables causes some extra complexity for compaction to deal with. For example, in the diagram below we repair a range covering half of the initial sstable. After repair, anticompaction splits it into a set of repaired and unrepaired sstables, at which point leveled and size-tiered compaction strategies handle segregation of the data differently.

Size-Tiered compaction takes a simple approach of splitting repaired and unrepaired sstables into separate pools, each of which is compacted independently. Leveled compaction simply performs size-tiered compaction on unrepaired data, moving into the proper levels after repair. This cuts down on write amplification compared to maintaining two leveling pools.

SizeTiered Compaction vs Leveled compaction in incremental repairs

Migrating to incremental repairs

Full repairs remain the default, largely so Cassandra doesn't have to guess the repaired state of existing sstables. Guessing that everything is fully repaired is obviously problematic; guessing that nothing is repaired is less obviously so: LCS would start size-tiering everything, since that is what it does now with unrepaired data! To avoid this, compaction remains unchanged until incremental repair is first performed and compaction detects sstables with the repairedAt flag.

Incremental repairs can be opted into via the -inc option to nodetool repair. This is compatible with both sequential and parallel (-par) repair, e.g., bin/nodetool -par -inc <ks> <cf>. When an sstable is fully covered by a repaired range, no anticompaction will occur, it will just rewrite the repairedAt field in sstable metadata. Recovering from missing data or corrupted sstables will require a non-incremental full repair.  (For more on anticompaction, see Marcus's post here.)

Effect of tools / commands on repair status

Since the sstable’s repair status is now tracked via it’s metadata, understanding how the set of tools provided with open-source Cassandra can impact this repair status becomes important.

  • Bulk Loading - even if repaired in a different cluster, loaded tables will be unrepaired.
  • Scrubbing - if scrubbing results in dropping rows, new sstables will be become unrepaired, however if no bad rows are detected, the sstable will keep its original repairedAt field.
  • Major compaction - STCS will combine each of its pools into a single sstable, one repaired and one not. Major compaction continues to have no effect under LCS.
  • Setting Repaired Status - a new tool added in 2.1 beta 2 can be found in tools/bin/sstablerepairedset that allows users to mark an sstable as repaired manually allowing for an easy migration to using incremental repairs by using the sstablerepairedset --is-repaired <sstable> command. It's important to only use this tool on repaired sstables, the status of an sstable can be checked via the /tools/bin/sstablemetadata tool by looking at the repairedAt field.

DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.

register for classes

get certified

DBA's Guide to NoSQL


Subscribe for newsletter:


Related Articles

cluster
troubleshooting
datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra