8/19/2020

Reading time:3 min

Backup and restore Cassandra cluster

by John Doe

We need to have backup/restore mechanism in Cassandra to deal with data loss from failures(e.g hardware failures, accidental deletes, deletes based on client errors etc.). Cassandra provides various ways to backup and restore the data(e.g snapshots or incremental backups, file copying etc). Read more about Cassandra backup and restore options from here. In this post I’m gonna discuss about manually backup and restore Cassandra data with file copying. Cassandra keeps its data in SSTable files on disk. SSTables are stored in the keyspace directory within the data directory path($CASSANDRA_HOME/data/data). Data directory path specified by the data_file_directories parameter in the cassandra.yaml file. When taking backups we can take a copy of this data directory with it’s SSTables. Then once we need to restore, we can simply copy them back to new node’s data directory and refresh the nodes(e.g nodetool refresh or sstableloader). In this post I’m gonna discuss about backup and restore multi-node Cassandra cluster with this file copying method. All the source codes which related to this post available in gitlab. Please clone the repo and continue the post.I have run Cassandra with Elassandra docker image. Elassandra built by combining Elasticseach with Cassandra. It comes Elasticsearch as a Cassandra plugin. Basically it has Cassandra API as well as Elasticsearch API. When data save on Cassandra it will automatically index on Elasticsearch. Read more about Elassandra from here. Following is the docker-compose.yml to run the Cassandra cluster with Elassandra docker.Now I can start the cluster with one node at a time. Main thing to notice here is I have added docker volume mapping to each Cassandra node’s data directory into my host machine /private/var/services/casstro directory. Each Cassandra nodes data will be stored in this directory.To demonstrate the backup/restore process I have created a Cassandra keyspace(storage_document) table(documents) and bootstrap sample data into the table.When taking the backup I can copy the contents of each Cassandra node’s data directory(SSTables locates in the data directory) into another location(e.g to another disk, AWS S3, Google Cloud etc). In this example I’m copying the content into another directory in my local machine for demonstration purpose. Before taking the backup of the data directory(with existing SSTables), I need to run run nodetool flush on each source node to assure that any data in Memtables is written out to the SSTables on disk. Otherwise data in the Memtable will be lost in the backup. Following is the way to take the backup with copying the files in each Cassandra node’s data directory.To restore the backup I can simply copy the backup folders to new node’s data directory. Following is the way to do that. In this example I’m copying the backup folders into same Cassandra nodes in the cluster for demonstration purpose. I first truncate the data on documents table to create the fresh cluster. To apply the restore I need to refresh the cluster with nodetool refresh.The same method can be used to backup and restore data in Elassandra as well. Elassandra keeps its Elasticsearch Lucene index data in $CASSANDRA_HOME/elasticsearch.data directory. If want, we can take the backup of elasticsearch.data directory and restore it. Otherwise we can only take backup of Cassandra SSTable in $CASSANDRA_HOME/data/data directory, when refreshing the nodes with nodetool refresh the Elasticsearch index will be automatically populate with the data on Cassandra. Following is the content of Elassandra index before restoring and after restoring. It contains all the data on Cassandra documents table after restoring the backup.There are some open-source tools which can be used to backup/restore the data on Cassandra. These tools built based on the Cassandra’s snapshot and incremental backup strategies. Tablesnap and Medusa are two such such open-source tool. Medusa is a command line backup and restore tool which built by Spotify to replace their legacy backup system. Then TLP was hired to take over developments of Medusa, make it production ready and open source. In coming post I will discuss about using Medusa to backup and restore Cassandra cluster data.https://dev.to/ecnepsnai/all-about-apache-cassandra-snapshots-3oo2https://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.htmlhttps://docs.datastax.com/en/ddac/doc/datastax_enterprise/operations/opsBackupRestoreTOC.htmlhttps://8kmiles.com/blog/cassandra-backup-and-restore-methods/https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.htmlhttps://medium.com/rahasak/elassandra-936ab46a6516https://saumitra.me/blog/how-cassandra-stores-data-on-filesystem/

Read this article if you want to know more about Backup and restore Cassandra cluster

We need to have backup/restore mechanism in Cassandra to deal with data loss from failures(e.g hardware failures, accidental deletes, deletes based on client errors etc.). Cassandra provides various ways to backup and restore the data(e.g snapshots or incremental backups, file copying etc). Read more about Cassandra backup and restore options from here. In this post I’m gonna discuss about manually backup and restore Cassandra data with file copying. Cassandra keeps its data in SSTable files on disk. SSTables are stored in the keyspace directory within the data directory path($CASSANDRA_HOME/data/data). Data directory path specified by the data_file_directories parameter in the cassandra.yaml file. When taking backups we can take a copy of this data directory with it’s SSTables. Then once we need to restore, we can simply copy them back to new node’s data directory and refresh the nodes(e.g nodetool refresh or sstableloader). In this post I’m gonna discuss about backup and restore multi-node Cassandra cluster with this file copying method. All the source codes which related to this post available in gitlab. Please clone the repo and continue the post.

I have run Cassandra with Elassandra docker image. Elassandra built by combining Elasticseach with Cassandra. It comes Elasticsearch as a Cassandra plugin. Basically it has Cassandra API as well as Elasticsearch API. When data save on Cassandra it will automatically index on Elasticsearch. Read more about Elassandra from here. Following is the docker-compose.yml to run the Cassandra cluster with Elassandra docker.

Now I can start the cluster with one node at a time. Main thing to notice here is I have added docker volume mapping to each Cassandra node’s data directory into my host machine /private/var/services/casstro directory. Each Cassandra nodes data will be stored in this directory.

To demonstrate the backup/restore process I have created a Cassandra keyspace(storage_document) table(documents) and bootstrap sample data into the table.

When taking the backup I can copy the contents of each Cassandra node’s data directory(SSTables locates in the data directory) into another location(e.g to another disk, AWS S3, Google Cloud etc). In this example I’m copying the content into another directory in my local machine for demonstration purpose. Before taking the backup of the data directory(with existing SSTables), I need to run run nodetool flush on each source node to assure that any data in Memtables is written out to the SSTables on disk. Otherwise data in the Memtable will be lost in the backup. Following is the way to take the backup with copying the files in each Cassandra node’s data directory.

To restore the backup I can simply copy the backup folders to new node’s data directory. Following is the way to do that. In this example I’m copying the backup folders into same Cassandra nodes in the cluster for demonstration purpose. I first truncate the data on documents table to create the fresh cluster. To apply the restore I need to refresh the cluster with nodetool refresh.

The same method can be used to backup and restore data in Elassandra as well. Elassandra keeps its Elasticsearch Lucene index data in $CASSANDRA_HOME/elasticsearch.data directory. If want, we can take the backup of elasticsearch.data directory and restore it. Otherwise we can only take backup of Cassandra SSTable in $CASSANDRA_HOME/data/data directory, when refreshing the nodes with nodetool refresh the Elasticsearch index will be automatically populate with the data on Cassandra. Following is the content of Elassandra index before restoring and after restoring. It contains all the data on Cassandra documents table after restoring the backup.

There are some open-source tools which can be used to backup/restore the data on Cassandra. These tools built based on the Cassandra’s snapshot and incremental backup strategies. Tablesnap and Medusa are two such such open-source tool. Medusa is a command line backup and restore tool which built by Spotify to replace their legacy backup system. Then TLP was hired to take over developments of Medusa, make it production ready and open source. In coming post I will discuss about using Medusa to backup and restore Cassandra cluster data.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further