We were trying to find solution to backup and restore Cassandra hosted on AWS without loosing too much data . We tried using the native nodetool snapshot but we will loose data between the snapshots.

We found a tool which seems to do the job well tablesnap . This tool provides scripts which can help backup , restore SSTables. We can restore data from specific time in past.

The following are the scripts which are part of this project.

  • tablesnap :  This script can be used for backup Cassandra data folder to S3
  • tableslurp : Used to restore the data files from S3
  • tablechop : Used to delete older data files from S3.

tablesnap script monitors the Cassandra data folder continuously and copies any new files which are created to configured S3 bucket. This tool also creates a JSON file which lists all the files in the data folder at the time of snapshot providing option to restore at a specific time in past.

The usage of the tool is as below

tablesnap [-h] -k AWS_KEY -s AWS_SECRET [-r] [-a] [-B] [-p PREFIX]
                 [--without-index] [--keyname-separator KEYNAME_SEPARATOR]
                 [-t THREADS] [-n NAME] [-e EXCLUDE | -i INCLUDE]
                 [--listen-events {IN_MOVED_TO,IN_CLOSE_WRITE}]
                 [--max-upload-size MAX_UPLOAD_SIZE]
                 [--multipart-chunk-size MULTIPART_CHUNK_SIZE]
                 bucket paths [paths ...]

This tool depends few basic principles of the Cassandra data 

  • The data is stored as series of files.
  • The SSTables are immutable structures and once created they are never modified.
  • During compaction the old SSTable files are deleted & a new file is created instead of updating the existing file.
  • New SSTables are created in temporary folder and then moved into the data folder. The tool listens to IN_MOVED or CLOSE_WRITE events hence file will be consistent when uploaded to S3

The usage of the script is as below, in a cluster this needs to be configured on each node with respective node names.

tablesnap -B -a –r --aws-region ap-southeast-1 mybucket -n node1 /var/lib/cassandra/data/mykeyspace

The content of the JSON file is as shown below which lists all the files which are part of the snapshot.

      "lb-1- big-Index.db",
      "lb-1- big-Digest.adler32",
      "lb-1- big-Statistics.db",
      "lb-1- big-CompressionInfo.db",
      "lb-1- big-Data.db",
      "lb-1- big-Summary.db",
      "lb-1- big-Filter.db",
      "lb-1- big-TOC.txt"

In order to restore the data from the S3 , we can use tableslurp tool . We need to identify the snapshot which we need to recover and use the tool to download the data from S3 as below.

udo tableslurp -- aws-region ap-southeast- 1 mybucket -n node1
/var/lib/cassandra/data/mykeyspace/users-cf815f100f9711e78211dbd467a9ea7d /var/lib/cassandra/data/mykeyspace/users-f234fgsdfsfdsfsdd
 --file lb-2-big-Data.db -o cassandra -g cassandra

In the above commands we are restoring the data on node1 which was backed up while SSTable lb-2-big-Data.db is created.

To recover the entire cluster we need to shutdown new nodes & recover respective data with tableslurp and then restart the nodes (seed nodes followed by others) . Once the cluster is up run nodetool repair to restore all the data.

One additional step which we have to take care to avoid loosing data which is in memtable is by periodic nodetool flush which will help to avoid loosing data .