In case you missed it, this blog post is a recap of Cassandra Lunch #30, covering the basics of Cassandra Data Operations. We discuss the various ways of moving data into and out of Cassandra clusters. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
Offline Methods
Offline methods of moving data in Cassandra involve capturing an image of the data at a particular moment in time. Since offline data transfers have no method for remaining up to date as the data changes, the application generally needs to be brought down during the transfer. Starting with a single database servicing our application, we transform the data to fit the schema of our destination database, before moving the data, bringing down the original database in order to verify the data matches, as loading that data into the new database.
Tools for Offline Transfer
Cassandra backup/restore technologies, disk snapshots, and command-line tools can be used to perform offline data transfers in Cassandra. CQL Copy can export small numbers of rows out to CSV format for transfer and is built into Cassandra. Copying SSTables, followed by using SSTablerLoader can be used to transfer more data. DSBulk is distributed by Datastax and is compatible with both DSE and Cassandra clusters. Dsbult can work with csv and JSON files, and combined with command-line tools like SED & AWK, but is also a bit more complicated to use than the methods mentioned above.
Online Methods
Online data migrations in Cassandra involve using dual writes or event queues. They are used in order to ensure parity of data beyond that being initially transferred. Using dual writes, as soon as the transfer begins, writes start going to database 2 as well. Then the bulk of the data is migrated over from database 1 to 2 before the validation process begins. Once the data has been confirmed to be in database two, applications switch to writing exclusively to database 2. Applications reading data are also switched over. Then database one can be brought down.
The event queue method is essentially the same except that the initial writes are to the event queue. That queue then facilitates the writing of new data to both databases. Reads continue to happen through database one until parity is assured, then they switch to happening via database two.
Tools for Online Transfer
Tools for online transfer often involve taking advantage of systems meant to move data around. For example, if we make changes to the topology of a Cassandra cluster. We can then take advantage of Cassandra’s internal data streaming to migrate our data. We can use queues or streams to mirror our data to different locations. Alternatively, we can use Spark, either via scripts, Spark Jobs, or the Spark Migrator. We can also use the same tools as offline transfers use, dsbulk and CQL copy to facilitate a similar transfer.
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!