Making and implementing a C* Migration Plan
Migrating to a new database is hard: really hard. It’s almost impossible to do it perfectly. I’d like to break the migration issue into two parts: 1) maintaining integrity of your data during import and migration and 2) how to operationally plan and code a migration plan to migrate from MySQL to C* downtime free (fingers crossed!).
Maintaining Data Integrity
In my personal experience, the most difficult component of our migration was not in writing a set of reliable scripts to read from MySQL and insert into Cassandra, but trivial coding mistakes that caused significant data discrepancies between the MySQL and Cassandra versions of the data.
Because migrating from MySQL to Cassandra will most likely require a change in your data model, the logic required to “convert” your relational MySQL data to it’s de-normalized form is the hardest part of the migration and certainly has the biggest risk.
Treat your migration scripts and logic not as one-off instances, but production quality code that can be run in any order, at any time. Mistakes in migration logic that result in an inconsistent version of the migrated data in Cassandra most likely will have a much greater impact than other dataset migration related bugs.
Personal Cassandra Migration Horror Story:
One of our datasets is the classification result of a particular domain from various sets of logic and sources. We use all of this data to make a final decision on what category a domain should be classified as (i.e. pornography, business, etc). When we started going thru our historical MySQL data we found data that had been inserted with incorrect values due to various application bugs over the years. With each example of this incorrect data, we would add conversion logic to “fix” the data best we could prior to syncing it from MySQL to Cassandra.
In one of these instances, I accidentally wrote logic that contained an equality check to see if a value was greater-than 0 to identify data in MySQL that needed to be fixed prior to migration. Unfortunately, some of the values being checked (in this case an epoch) were originally inserted as “0”. This meant my greater-than equality check ended up skipping some data during the sync. Although we had spent weeks attempting to validate the final resulting data after the migration, we missed this bug prior to production. The end result was incorrectly classifying high profile domains and ultimately inconveniencing our customers.
The actual number of rows impacted from the bug was only around 1,000 out of the 100+ million correctly migrated rows however; the impact of those 1,000 rows to our customers was very real and significant.
Get to know Bulk Loading
Regardless of the migration strategy you end up choosing, in almost all cases you will have to perform an initial bulk import of your existing MySQL data into C*. While it might be tempting to simply iterate over every MySQL result and then insert that result one mutation at a time into Cassandra, a more efficient way is to use the Cassandra Bulk Loader. At a high level, the Bulk Loader requires you to create a CSV file containing all of the rows and columns that need to be loaded into Cassandra. Using the Java class SSTableSimpleUnsortedWriter, you can create an sstable from your csv file, which can then be loaded directly into Cassandra using sstableloader.
For more details and code samples reference this article.
Different Ways to Migrate without Downtime
Sync Data Method:
When migrating to Cassandra and choosing a new data model might significantly increase your database workload. Alternatively, you might still need a live dataset in MySQL after the initial migration for legacy scripts that have not yet been migrated to use Cassandra.
Syncing from MySQL to Cassandra
In some cases it might not be practicable to add Cassandra to a legacy application. In this case it might be necessary to have an external process sync data from MySQL to Cassandra while running both new and old logic in parallel.
1st Solution: Add a timestamp column to the MySQL table to be synced. With each update to MySQL also update the timestamp with the last updated time. At a scheduled interval then do a SELECT query from all MySQL shards where the last updated timestamp is greater than or equal to the time your last sync started.
Syncing from Cassandra back to MySQL
Some data models will be hard to sync from Cassandra back to MySQL (for example time series data). However, rows containing more de-normalized “metadata”-like information can be synced.
What won’t work: Creating a sync script that executes via cron every n minutes and attempts to do a SELECT * FROM TABLE foo from Cassandra (and then update and insert all of those records into MySQL) is a recipe for failure. Inherent to Cassandra’s design is that data is sharded across multiple nodes by a hash of it’s key. Performing a SELECT * query is a Cassandra anti-pattern and should be avoided. Iterating thru every key across all nodes and returning a single paged dataset is both inefficient and impracticable.
1st Solution: Implement a queue that your application additionally writes to when it modifies a row in Cassandra. Have a script consume from this queue and de-duplicate the modified keys on a time interval and then bulk insert updates into MySQL.
2nd Solution: If the data can be updated less frequently into MySQL, you could write a Hadoop Map/Reduce job that iterates over the column families that you need to sync. This solution gives a practicable and reproducible way to iterate thru all keys in a column family. Using this approach as an additional sanity option to resolve missed updates from other incremental sync options is also not a bad option.
3rd Solution: Another option if you can afford a greater delay in the delta between updates from Cassandra back to MySQL is to use a tool such as sstable2json to dump a column families sstables into a JSON format, which can then be parsed and then used to update MySQL. This is a pretty heavy-handed method. Additionally, you’ll have to write logic to ensure you dump the sstables from all nodes to get the entire column family.
Write Twice and Forget Method:
If you are able to modify your existing application to also interface with Cassandra, you can initially start your migration by writing database updates twice, once to MySQL and an additional time to Cassandra.
Once you have all new updates being written to both MySQL and Cassandra, you can run a migration script that pages thru all your existing MySQL data and inserts those records into Cassandra.
Initially, you might want to implement this second write to Cassandra as a completely non-blocking, write and forget, operation. If you experience initial issues during your Cassandra deployment, make sure not to impact your existing application when Cassandra is down.
Once you are satisfied with the fire-and-forget writes, you can slowly modify your application logic to start performing reads from Cassandra instead of MySQL. Thanks to the duel writes, if you run into issues, simply revert back to doing reads from MySQL.
The “Just-Rip-Off The Band-Aid” Method:
The final method I’ll propose is a hard cutover between MySQL and Cassandra. I’d strongly recommend against this. Getting all the data fully migrated from MySQL to Cassandra an error-prone process.
Stay tuned this week for more posts from Michael Kjellman. This post is an excerpt from 'Frustrated with MySQL? Improving the scalability, reliability and performance of your application by migrating from MySQL to Cassandra.' In the meantime, check out our other Cassandra Week posts!