Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/13/2018

Reading time:4 min

Repair in Cassandra

by John Doe

Learn more about Apache CassandraWe have talked about new and advanced features available in repair a few times, but in this post I am going to cover an overview of repair, and how to think about what happens when using it. When thinking about repair in Cassandra you need to think in terms of token ranges not tokens. Lets start with the simple case of having one data center with 10 nodes and 100 tokens. Where node N0 is assigned token 0, N1 is assigned token 10, and so on.With a replication factor of 3, N3 will own data for tokens 1–30. And if we look at where that data will be replicated, you get range 1–10 shared by N1, N2, N3; 11–20 shared by N2, N3, N4; 21–30 shared by N3, N4, N5.When nodetool repair is run against a node it initiates a repair for some range of tokens. The range being repaired depends on what options are specified. The default options, just calling “nodetool repair”, initiate a repair of every token range owned by the node. The node you issued the call to becomes the coordinator for the repair operation, and it coordinates repairing those token ranges between all of the nodes that own them. So for our example cluster, if you call “nodetool repair” on N3 the first thing that will hapen is to split the 1–30 range into the subranges based on groups of nodes which share that data. For each of those sub ranges, the 3 nodes will all compare the data in those ranges, and fix up any differences between themselves. So for sub range 1–10, N1 and N2 compare, N1 and N3 compare, N2 and N3 compare. After the repair finishes, those 3 nodes will be completely in sync for that token range. And a similar process happens for ranges 11–20 (N2, N3, N4) and 21–30 (N3, N4, N5). So just running “nodetool repair” on N3 there will be 5 nodes repairing data with each other, and at the end tokens 1–30 will be in sync across their respective owners. Now if we next run a “nodetool repair” on N4, we will repair the range 11–40 in a similar fashion. But wait, we just repaired range 11–30, if we repair that again we will be wasting resources. This is where the “-pr” option comes in to help. When you use “nodetool repair -pr” each node picks a subset of its token range to schedule for repair, such that if “-pr” is run on EVERY node in the cluster, every token range will only be repaired once. What that means is, when ever you use -pr, you need to be repairing the entire ring (every node in every data center). If you use “-pr” on just one node, or just the nodes in one data center, you will only repair a subset of the data on those nodes. This is very important, so I’m going to say it again, if you are using “nodetool repair -pr” you must run it on EVERY node in EVERY data center, no skipping allowed. Back to the example ring. Since we don’t want to repair a given piece of data multiple times, we are going to use the “-pr” option. When you issue “nodetool repair -pr” against node N3 only the range 21–30 will be repair. Similarly “nodetool repair -pr” on N4 will repair 31–40. As you can see we no longer have any overlapping data being repaired, but this also means, if you only ran “nodetool repair -pr” on N3, tokens 1–20 would not have been repaired. This is very important to know if you are running repair to fix a problem, and not as part of general maintenance. When running repair to fix a problem, like a node being down for longer than the hint windows, you need to repair the entire token range of that node. So you can’t just run “nodetool repair -pr” on it. You need to initiate a full “nodetool repair” on it, or do a full cluster repair with “nodetool repair -pr”. If you have multiple data centers, by default when running repair all nodes in all data centers will sync with each other on the range being repaired. So for an RF of {DC1:3, DC2:3} for a given token range there will be 6 nodes all comparing data with each other and streaming any differences back and forth. If you have 4 data centers {DC1:3, DC2:3, DC3:3, DC4:3} you will have 12 nodes all comparing with each other and streaming data to each other at the same time for each token range. This makes using “-pr” even more important, as if you don’t use it you repair a given token range 3+3+3+3+=12 times for the 4 DC case if you ran without using “-pr” on every node in the cluster.A new feature available in Cassandra 2.1 is incremental repair. You can see the linked blog post for an in depth description of it, but in brief, incremental repair only performs the synchronization steps described above on data that has not been repaired previously. This helps to greatly reduce the amount of time it takes to repair new data.DataStax has many ways for you to advance in your career and knowledge. You can take free classes, get certified, or read one of our many white papers.register for classesget certifiedDBA's Guide to NoSQLSubscribe for newsletter:

Illustration Image

Learn more about Apache Cassandra

We have talked about new and advanced features available in repair a few times, but in this post I am going to cover an overview of repair, and how to think about what happens when using it.

10 node ring When thinking about repair in Cassandra you need to think in terms of token ranges not tokens. Lets start with the simple case of having one data center with 10 nodes and 100 tokens. Where node N0 is assigned token 0, N1 is assigned token 10, and so on.

With a replication factor of 3, N3 will own data for tokens 1–30. And if we look at where that data will be replicated, you get range 1–10 shared by N1, N2, N3; 11–20 shared by N2, N3, N4; 21–30 shared by N3, N4, N5.

When nodetool repair is run against a node it initiates a repair for some range of tokens. The range being repaired depends on what options are specified. The default options, just calling “nodetool repair”, initiate a repair of every token range owned by the node. The node you issued the call to becomes the coordinator for the repair operation, and it coordinates repairing those token ranges between all of the nodes that own them.

So for our example cluster, if you call “nodetool repair” on N3 the first thing that will hapen is to split the 1–30 range into the subranges based on groups of nodes which share that data. For each of those sub ranges, the 3 nodes will all compare the data in those ranges, and fix up any differences between themselves. So for sub range 1–10, N1 and N2 compare, N1 and N3 compare, N2 and N3 compare. After the repair finishes, those 3 nodes will be completely in sync for that token range. And a similar process happens for ranges 11–20 (N2, N3, N4) and 21–30 (N3, N4, N5). So just running “nodetool repair” on N3 there will be 5 nodes repairing data with each other, and at the end tokens 1–30 will be in sync across their respective owners. Now if we next run a “nodetool repair” on N4, we will repair the range 11–40 in a similar fashion.

But wait, we just repaired range 11–30, if we repair that again we will be wasting resources. This is where the “-pr” option comes in to help. When you use “nodetool repair -pr” each node picks a subset of its token range to schedule for repair, such that if “-pr” is run on EVERY node in the cluster, every token range will only be repaired once. What that means is, when ever you use -pr, you need to be repairing the entire ring (every node in every data center). If you use “-pr” on just one node, or just the nodes in one data center, you will only repair a subset of the data on those nodes. This is very important, so I’m going to say it again, if you are using “nodetool repair -pr” you must run it on EVERY node in EVERY data center, no skipping allowed.

Back to the example ring. Since we don’t want to repair a given piece of data multiple times, we are going to use the “-pr” option. When you issue “nodetool repair -pr” against node N3 only the range 21–30 will be repair. Similarly “nodetool repair -pr” on N4 will repair 31–40. As you can see we no longer have any overlapping data being repaired, but this also means, if you only ran “nodetool repair -pr” on N3, tokens 1–20 would not have been repaired. This is very important to know if you are running repair to fix a problem, and not as part of general maintenance. When running repair to fix a problem, like a node being down for longer than the hint windows, you need to repair the entire token range of that node. So you can’t just run “nodetool repair -pr” on it. You need to initiate a full “nodetool repair” on it, or do a full cluster repair with “nodetool repair -pr”.

If you have multiple data centers, by default when running repair all nodes in all data centers will sync with each other on the range being repaired. So for an RF of {DC1:3, DC2:3} for a given token range there will be 6 nodes all comparing data with each other and streaming any differences back and forth. If you have 4 data centers {DC1:3, DC2:3, DC3:3, DC4:3} you will have 12 nodes all comparing with each other and streaming data to each other at the same time for each token range. This makes using “-pr” even more important, as if you don’t use it you repair a given token range 3+3+3+3+=12 times for the 4 DC case if you ran without using “-pr” on every node in the cluster.

A new feature available in Cassandra 2.1 is incremental repair. You can see the linked blog post for an in depth description of it, but in brief, incremental repair only performs the synchronization steps described above on data that has not been repaired previously. This helps to greatly reduce the amount of time it takes to repair new data.


DataStax has many ways for you to advance in your career and knowledge.

You can take free classes, get certified, or read one of our many white papers.

register for classes

get certified

DBA's Guide to NoSQL


Subscribe for newsletter:


Related Articles

cluster
troubleshooting
datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra