Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

4/3/2018

Reading time:2 min

Snap Cassandra to S3 with tablesnap

by John Doe

We were trying to find solution to backup and restore Cassandra hosted on AWS without loosing too much data . We tried using the native nodetool snapshot but we will loose data between the snapshots. We found a tool which seems to do the job well tablesnap . This tool provides scripts which can help backup , restore SSTables. We can restore data from specific time in past. The following are the scripts which are part of this project. tablesnap :  This script can be used for backup Cassandra data folder to S3 tableslurp : Used to restore the data files from S3 tablechop : Used to delete older data files from S3. tablesnap script monitors the Cassandra data folder continuously and copies any new files which are created to configured S3 bucket. This tool also creates a JSON file which lists all the files in the data folder at the time of snapshot providing option to restore at a specific time in past. The usage of the tool is as below tablesnap [-h] -k AWS_KEY -s AWS_SECRET [-r] [-a] [-B] [-p PREFIX]                 [--without-index] [--keyname-separator KEYNAME_SEPARATOR]                 [-t THREADS] [-n NAME] [-e EXCLUDE | -i INCLUDE]                 [--listen-events {IN_MOVED_TO,IN_CLOSE_WRITE}]                 [--max-upload-size MAX_UPLOAD_SIZE]                 [--multipart-chunk-size MULTIPART_CHUNK_SIZE]                 bucket paths [paths ...] This tool depends few basic principles of the Cassandra data  The data is stored as series of files. The SSTables are immutable structures and once created they are never modified. During compaction the old SSTable files are deleted & a new file is created instead of updating the existing file. New SSTables are created in temporary folder and then moved into the data folder. The tool listens to IN_MOVED or CLOSE_WRITE events hence file will be consistent when uploaded to S3 The usage of the script is as below, in a cluster this needs to be configured on each node with respective node names. tablesnap -B -a –r --aws-region ap-southeast-1 mybucket -n node1 /var/lib/cassandra/data/mykeyspace The content of the JSON file is as shown below which lists all the files which are part of the snapshot. {     "/var/lib/cassandra/data/mykeyspace/users-cf815f100f9711e78211dbd467a9ea7d":[        "backups",      "lb-1- big-Index.db",      "lb-1- big-Digest.adler32",      "lb-1- big-Statistics.db",      "lb-1- big-CompressionInfo.db",      "lb-1- big-Data.db",      "lb-1- big-Summary.db",      "lb-1- big-Filter.db",      "lb-1- big-TOC.txt"   ] } In order to restore the data from the S3 , we can use tableslurp tool . We need to identify the snapshot which we need to recover and use the tool to download the data from S3 as below. udo tableslurp -- aws-region ap-southeast- 1 mybucket -n node1/var/lib/cassandra/data/mykeyspace/users-cf815f100f9711e78211dbd467a9ea7d /var/lib/cassandra/data/mykeyspace/users-f234fgsdfsfdsfsdd --file lb-2-big-Data.db -o cassandra -g cassandra In the above commands we are restoring the data on node1 which was backed up while SSTable lb-2-big-Data.db is created. To recover the entire cluster we need to shutdown new nodes & recover respective data with tableslurp and then restart the nodes (seed nodes followed by others) . Once the cluster is up run nodetool repair to restore all the data. One additional step which we have to take care to avoid loosing data which is in memtable is by periodic nodetool flush which will help to avoid loosing data .

Illustration Image

We were trying to find solution to backup and restore Cassandra hosted on AWS without loosing too much data . We tried using the native nodetool snapshot but we will loose data between the snapshots.

We found a tool which seems to do the job well tablesnap . This tool provides scripts which can help backup , restore SSTables. We can restore data from specific time in past.

The following are the scripts which are part of this project.

  • tablesnap :  This script can be used for backup Cassandra data folder to S3
  • tableslurp : Used to restore the data files from S3
  • tablechop : Used to delete older data files from S3.

tablesnap script monitors the Cassandra data folder continuously and copies any new files which are created to configured S3 bucket. This tool also creates a JSON file which lists all the files in the data folder at the time of snapshot providing option to restore at a specific time in past.

The usage of the tool is as below

tablesnap [-h] -k AWS_KEY -s AWS_SECRET [-r] [-a] [-B] [-p PREFIX]
                 [--without-index] [--keyname-separator KEYNAME_SEPARATOR]
                 [-t THREADS] [-n NAME] [-e EXCLUDE | -i INCLUDE]
                 [--listen-events {IN_MOVED_TO,IN_CLOSE_WRITE}]
                 [--max-upload-size MAX_UPLOAD_SIZE]
                 [--multipart-chunk-size MULTIPART_CHUNK_SIZE]
                 bucket paths [paths ...]

This tool depends few basic principles of the Cassandra data 

  • The data is stored as series of files.
  • The SSTables are immutable structures and once created they are never modified.
  • During compaction the old SSTable files are deleted & a new file is created instead of updating the existing file.
  • New SSTables are created in temporary folder and then moved into the data folder. The tool listens to IN_MOVED or CLOSE_WRITE events hence file will be consistent when uploaded to S3

The usage of the script is as below, in a cluster this needs to be configured on each node with respective node names.

tablesnap -B -a –r --aws-region ap-southeast-1 mybucket -n node1 /var/lib/cassandra/data/mykeyspace

The content of the JSON file is as shown below which lists all the files which are part of the snapshot.

{  
   "/var/lib/cassandra/data/mykeyspace/users-cf815f100f9711e78211dbd467a9ea7d":[  
      "backups",
      "lb-1- big-Index.db",
      "lb-1- big-Digest.adler32",
      "lb-1- big-Statistics.db",
      "lb-1- big-CompressionInfo.db",
      "lb-1- big-Data.db",
      "lb-1- big-Summary.db",
      "lb-1- big-Filter.db",
      "lb-1- big-TOC.txt"
   ]
}

In order to restore the data from the S3 , we can use tableslurp tool . We need to identify the snapshot which we need to recover and use the tool to download the data from S3 as below.

udo tableslurp -- aws-region ap-southeast- 1 mybucket -n node1
/var/lib/cassandra/data/mykeyspace/users-cf815f100f9711e78211dbd467a9ea7d /var/lib/cassandra/data/mykeyspace/users-f234fgsdfsfdsfsdd
 --file lb-2-big-Data.db -o cassandra -g cassandra

In the above commands we are restoring the data on node1 which was backed up while SSTable lb-2-big-Data.db is created.

To recover the entire cluster we need to shutdown new nodes & recover respective data with tableslurp and then restart the nodes (seed nodes followed by others) . Once the cluster is up run nodetool repair to restore all the data.

One additional step which we have to take care to avoid loosing data which is in memtable is by periodic nodetool flush which will help to avoid loosing data .

Related Articles

database
datastax
aws

Getting Started with DataStax Astra DB and Amazon Bedrock | DataStax

John Doe

11/30/2023

aws
cassandra

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

aws