Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

11/3/2017

Reading time:3 min

tbarbugli/cassandra_snapshotter

by John Doe

A tool to backup cassandra nodes using snapshots and incremental backups on S3The scope of this project is to make it easier to backup a cluster to S3 and to combinesnapshots and incremental backups.How to installBoth the machine that runs the backup and the Cassandra nodes need to install the toolpip install cassandra_snapshotterNodes in the cluster also need to have lzop installed so that backups on S3 can be archived compressedYou can install it on Debian/Ubuntu via apt-getapt-get install lzopMake sure you have JNA enabled and (if you want to use them) that incremental backups are enabled in your cassandra config file.UsageYou can see the list of parameters available via cassandra-snapshotter --helpCreate a new backup for mycluster:cassandra-snapshotter --s3-bucket-name=Z \ --s3-bucket-region=eu-west-1 \ --s3-base-path=mycluster \ --aws-access-key-id=X \ # optional --aws-secret-access-key=Y \ # optional --s3-ssenc \ # optional backup \ --hosts=h1,h2,h3,h4 \ --user=cassandra # optionalconnects via ssh to hosts h1,h2,h3,h4 using user cassandrabackups up (using snapshots or incremental backups) on the S3 bucket Zbackups are stored in /mycluster/if your bucket is not on us-east-1 region (the default region), you should really specify the region in the command line; otherwise weird 'connection reset by peer' errors can appear as you'll be transferring files amongst regions--aws-access-key-id and --aws-secret-access-key are optional. Omitting them will use the instance IAM profile. See http://docs.pythonboto.org/en/latest/boto_config_tut.html for more details.if you wish to use AWS S3 server-side encryption specify --s3-ssencList existing backups for mycluster:cassandra-snapshotter --s3-bucket-name=Z \ --s3-bucket-region=eu-west-1 \ --s3-base-path=mycluster \ --aws-access-key-id=X \ # optional --aws-secret-access-key=Y \ # optional --s3-ssenc \ # optional listHow it workscassandra_snapshotter connects to your cassandra nodes using ssh and uses nodetool to generatethe backups for keyspaces / table you want to backup.Backups are stored on S3 using this convention:Snapshots:/s3_base_path/snapshot_creation_time/hostname/cassandra/data/path/keyspace/table/snapshotsIncremental Backups:/s3_base_path/snapshot_creation_time/hostname/cassandra/data/path/keyspace/table/backupsS3_BASE_PATHThis parameter is used to make it possible to use for a single S3 bucket to store multiple cassandra backups.This parameter can be also seen as a backup profile identifier; the snapshotter uses the s3_base_path to search for existing snapshots on your S3 bucket.INCREMENTAL BACKUPSIncremental backups are created only when a snapshot already exists, incremental backups are stored in their parent snapshot path.incremental_backups are only used when all this conditions are met:there is a snapshot in the same base_paththe existing snapshot was created for the same list of nodesthe existing snapshot was created with the same keyspace / table parametersif one of this condition is not met a new snapshot will be created.In order to take advantage of incremental backups you need to configure your cassandra cluster for it (see cassandra.yaml config file).NOTE: Incremental backups are not enabled by default on cassandra.CREATE NEW SNAPSHOTIf you dont want to use incremental backups, or if for some reason you want to create a new snapshot for your data, run the cassandra_snapshotter with the --new-snapshot argument.Data retention / Cleanup old snapshotsIts not in the scope of this project to clean up your S3 buckets.S3 Lifecycle rules allows you do drop or archive to Glacier object stored based on their age.Restore your datacassandra_snaphotter tries to store data and metadata in a way to make restores less painful; There is not (yet) a feature complete restore command; every patch / pull request about this is more than welcome (hint hint).In case you need, cassandra_snapshotter stores the ring token description every time a backup is done ( you can find it the ring file in the snapshot base path )The way data is stored on S3 should makes it really easy to use the Node Restart Method (https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_backup_snapshot_restore_t.html#task_ds_vf4_z1r_gk)

Illustration Image

A tool to backup cassandra nodes using snapshots and incremental backups on S3

The scope of this project is to make it easier to backup a cluster to S3 and to combine snapshots and incremental backups.

Build Status

How to install

Both the machine that runs the backup and the Cassandra nodes need to install the tool

pip install cassandra_snapshotter

Nodes in the cluster also need to have lzop installed so that backups on S3 can be archived compressed

You can install it on Debian/Ubuntu via apt-get

apt-get install lzop

Make sure you have JNA enabled and (if you want to use them) that incremental backups are enabled in your cassandra config file.

Usage

You can see the list of parameters available via cassandra-snapshotter --help

Create a new backup for mycluster:

cassandra-snapshotter --s3-bucket-name=Z \
                      --s3-bucket-region=eu-west-1 \
                      --s3-base-path=mycluster \
                      --aws-access-key-id=X \ # optional
                      --aws-secret-access-key=Y \ # optional
                      --s3-ssenc \ # optional
                      backup \
                      --hosts=h1,h2,h3,h4 \
                      --user=cassandra # optional
  • connects via ssh to hosts h1,h2,h3,h4 using user cassandra
  • backups up (using snapshots or incremental backups) on the S3 bucket Z
  • backups are stored in /mycluster/
  • if your bucket is not on us-east-1 region (the default region), you should really specify the region in the command line; otherwise weird 'connection reset by peer' errors can appear as you'll be transferring files amongst regions
  • --aws-access-key-id and --aws-secret-access-key are optional. Omitting them will use the instance IAM profile. See http://docs.pythonboto.org/en/latest/boto_config_tut.html for more details.
  • if you wish to use AWS S3 server-side encryption specify --s3-ssenc

List existing backups for mycluster:

cassandra-snapshotter --s3-bucket-name=Z \
                      --s3-bucket-region=eu-west-1 \
                      --s3-base-path=mycluster \
                      --aws-access-key-id=X \ # optional
                      --aws-secret-access-key=Y \ # optional
                      --s3-ssenc \ # optional
                      list

How it works

cassandra_snapshotter connects to your cassandra nodes using ssh and uses nodetool to generate the backups for keyspaces / table you want to backup.

Backups are stored on S3 using this convention:

Snapshots:

/s3_base_path/snapshot_creation_time/hostname/cassandra/data/path/keyspace/table/snapshots

Incremental Backups:

/s3_base_path/snapshot_creation_time/hostname/cassandra/data/path/keyspace/table/backups

S3_BASE_PATH

This parameter is used to make it possible to use for a single S3 bucket to store multiple cassandra backups.

This parameter can be also seen as a backup profile identifier; the snapshotter uses the s3_base_path to search for existing snapshots on your S3 bucket.

INCREMENTAL BACKUPS

Incremental backups are created only when a snapshot already exists, incremental backups are stored in their parent snapshot path.

incremental_backups are only used when all this conditions are met:

  • there is a snapshot in the same base_path
  • the existing snapshot was created for the same list of nodes
  • the existing snapshot was created with the same keyspace / table parameters

if one of this condition is not met a new snapshot will be created.

In order to take advantage of incremental backups you need to configure your cassandra cluster for it (see cassandra.yaml config file).

NOTE: Incremental backups are not enabled by default on cassandra.

CREATE NEW SNAPSHOT

If you dont want to use incremental backups, or if for some reason you want to create a new snapshot for your data, run the cassandra_snapshotter with the --new-snapshot argument.

Data retention / Cleanup old snapshots

Its not in the scope of this project to clean up your S3 buckets. S3 Lifecycle rules allows you do drop or archive to Glacier object stored based on their age.

Restore your data

cassandra_snaphotter tries to store data and metadata in a way to make restores less painful; There is not (yet) a feature complete restore command; every patch / pull request about this is more than welcome (hint hint).

In case you need, cassandra_snapshotter stores the ring token description every time a backup is done ( you can find it the ring file in the snapshot base path )

The way data is stored on S3 should makes it really easy to use the Node Restart Method (https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_backup_snapshot_restore_t.html#task_ds_vf4_z1r_gk)

Related Articles

data.modeling
open.source
cassandra

johnnywidth/cql-calculator

John Doe

6/17/2020

cassandra
tool

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra