Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

3/30/2020

Reading time:15 mins

Managing (Schema) Migrations in Cassandra

by DataStax Academy

We've had some sexy, exciting, cutting-edge topics today. This is not one of them. ... This is more the sort of routine, good-housekeeping, foundational work that can make the exciting stuff a little less exciting. I’m going to be talking about managing migrations in Cassandra and in particular schema migrations.Let me give a nod to my employer. From the web site: “ GridPoint is a leader in comprehensive, data-driven energy management solutions (EMS) that leverage the power of real-time data collection, big data analytics and cloud computing to maximize energy savings, operational efficiency, capital utilization and sustainability benefits.” The company is based in Arlington, VA, with a development office in Seattle.Disclaimer… This is my perspective.Oh, the statue you see is from Bonn, Germany, according to the photographer.A live-data migration is the process that runs to take the data in one table and adapt it to another table, such that the data in the first table can eventually be retired.I’m not going to be focusing so much on live-data migrations.I’m going to be focusing instead on what I would call source-driven migrations.For schema migrations, think DDL.The migrations are stored in source control and subject to source control versioning. They may be published to an artifact repository, where they artifact versioning and release versioning can be applied.I’ll be focusing in particular on schema migrations.These sorts of problems are covered in depth in this book from the Martin Fowler series that came out in 2006.I can’t speak to containerizing migrations. We haven’t explored that.A couple other established standalone tools are DBMaintain and DBDeploy, although those projects have not been active in recent years.12.2. Schema Changes in RDBMSLiquibase, Mybatis Migrator, DBDeploy, DBMaintain12.3. Schema Changes in a NoSQL Data Storethe schema needs to change frequently in response to changing business requirements | can use similar techniques as with databases with strong schemas with schemalessness at the database level, the burden of supporting the effective schema is shifted up to the application | the application still needs to be able to (un)marshal the dataWith this slide, I hope you can see that I’m setting up a bit of a straw man. (A straw man with a strong man.)There was a StackOverflow thread on schema migration tools for Cassandra (http://stackoverflow.com/questions/25286273/is-there-a-schema-versioning-tool-for-cassandra), and there was an erroneous answer I found amusing:"Cassandra is by its nature… 'schemaless.' It is a structured key-value store, so it is very different from a traditional rdbms in that regard.” Think about it though. With Cassandra as much as with a relational database, you pay a bitter price for getting your schema wrong. You end up defining a good number of tables. I have the fortune of not having worked much with Thrift. But I know that with Thrift, you'd be in the business of manipulating the contents of messages, which obscures the database's desire to have a schema applied to it. With Thrift, you had super columns and super column families. With CQL, you have collections. But the collections still have to be part of a table. The things that might smack of schemalessness still come back to a schema. ===========================================Thought experiment. Go into cqlsh and execute:describe keyspace keyspace_name How big is that output getting? How much is it changing over time?===========================================At last month's Cassandra Summit, there was an interesting talk by a company called Reltio, and they described how they were using Cassandra to support "metadata-driven documents in columnar storage." So they produced a keyspace that had a generic table like this. And maybe that schema only had one or two tables. But even they acknowledged that this is an atypical use case for Cassandra.===========================================So how have teams been managing their keyspace and table definitions? My anecdotal experience is that whenever the question has come up, teams have usually rolled their own, especially because, on the face of it, or in the simple case, this seems like such a simple thing. Next I want to get into the tools that are out there for Cassandra migrations, and the roadblocks teams have faced trying to manage Cassandra schema migrations via LiquiBase and Flyway.===========================================Some history. The obvious way to integrate Liquibase or Flyway with Cassandra comes back to the prospect of the DataStax Java Driver supporting JDBC. There’s this statement from the 2013 announcement of the introduction of the driver (http://www.datastax.com/dev/blog/new-datastax-drivers-a-new-face-for-cassandra): "Today, DataStax announces version 1.0.0 of a new Java Driver, designed for CQL and based on years of experience within the Cassandra community. This Java driver is a first step; an object mapping and a JDBC extension will be available soon…."Let’s keep that JDBC extension in mind.===========================================There was a liquibase-cassandra project that seemed to hit a wall. So some people gravitated toward Flyway.===========================================Then there was a GitHub issue for the Flyway project , “Cassandra support.”https://github.com/flyway/flyway/issues/823 In January someone mentions a cassandra-jdbc project that’s out there and which also seems to have hit a wall."I …recently looked into adding support for Cassandra to Flyway, but using the existing cassandra-jdbc driver from https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ , just to see how far I could get. I found a few issues:"Proceeds to list the issues."I disabled or stubbed out code to get past these, but gave up soon after." That same poster referenced a thread he started on the DataStax Java Driver user mailing list.===========================================So if we go to that thread, which is from last December (https://groups.google.com/a/lists.datastax.com/forum/#!msg/java-driver-user/kspAx0neZlI/8A59HmYc-rwJ):Subject: "Timeline for JDBC support?" "Is there any timeline for JDBC support in the DataStax Java Driver for Cassandra, please?" Alex Popescu, Sen. Product Manager @ DataStax responds:"While I cannot (yet) promise an ETA for JDBC support, what I can say is that it's on our todo list (and very close to the top)."===========================================I look forward to seeing how DataStax pulls off the Cassandra JDBC support, but to my mind, trying to do JDBC against Cassandra seems like, I dunno, a bit of an uphill climb. So let's side aside the prospect of first-class Cassandra support in Flyway and see what else is out there.===========================================Toward the end of the DataStax Java Driver mailing list thread, someone else chimes in and mentions Pillar, which is a dedicated Cassandra migrations tool written in Scala. And here’s roughly what I wrote in my own internal tool evaluation:“Before settling on (our) Flyway design for Cassandra schema migrations, I evaluated various open-source Cassandra migration tools. They’re listed below. Of them, the most promising tool was Pillar, which is implemented in Scala. The problem with Pillar vs. (Flyway) was the risk. I was afraid I’d invest time with Pillar and come up emptyhanded, that it wouldn’t deliver the sort of contract I expect from Flyway.” That’s what I wrote. I’m happy we went down the road we did (if I weren’t I wouldn’t be here talking about it), but I’d still maintain that Pillar is worth checking out. There's mutagen-cassandra, which is a Java tool written against the Astyanax driver but which hasn't been adapted to the DataStax Java Driver. Then there are these three Python-based tools: Trireme, cql-migrate, mschematool.Here’s a view of a migrations table that’s responsible for several schemas in PostgreSQL, with PostgeSQL’s concept of a schema, analogous to a keyspace in Cassandra.So let’s get back to the two prominent database migration tools in the relational world.I think of Liquibase as the Martha Stewart of migration tools. It’s somewhat of a control freak. It wants to do everything itself.On the other hand, I think of Flyway as the Oprah of migration tools. It provides a framework and then gives you the space to figure things out for yourself.You see, Liquibase wants to generate the SQL from XML constructs. In the typical usage, the SQL is NOT a first-class citizen. You can define Liquibase migrations as SQL, but even then (to the best of my knowledge) you have to define it inline in the XML.With Flyway, though, SQL is a first-class citizen. You can make migrations ouf of straight .sql files. It’s Flyway’s lightweight, inobtrusive, extensible approach that’s going to provide the leverage for using it with Cassandra.So instead of first-class Flway, we’re going to do faked-out Flyway.The idea is, let Flyway do what it knows, which is migrations. Let Cassandra do what it knows, which is CQL. All we need is an adapter or translator to connect the two.And one key point. When I say that Flyway knows migrations, I’m saying that Flyway knows migrations in SQL.So here’s the tradeoff. Or “the weird trick,” to use the parlance of an Internet ad.Here’s what I wrote in my own internal design doc:“The reality is that first-class Flyway support for Cassandra doesn’t really gain us anything more than our fake-Flyway solution does, especially considering that we’re fine with persisting the Flyway migrations table to PostgreSQL; once you’re embracing polyglot persistence, you’d realize that a relational database is a better fit anyway for keeping track of the migrations.” Failure handling: If a migration produces invalid CQL, the driver throws a RuntimeException. The act of throwing a RuntimeException is the signal I need to tell the JDBC Connection to roll back the transaction. This emulates the JDBC contract where RuntimeExceptions cause the transaction in the actual migrate call to roll back. We do this with the beforeEachMigrate hook so that we have a chance to fail the migration before our dummy, token migration has a chance to run. Flyway will have succeeded with all the migrations up to that point; it will fail only with this particular migration. That preserves the expected Flyway behavior.Our migrations follow a two-step process. At build time, we produce an artifact that gets published to an artifact repository. That’s the work of a proprietary class called MigrationsBuilder. At runtime, we have another custom class called FlywayMigrator that runs the published migrations against the target database.In the simple case with Flyway, there’s only a single step, the deploy-time step, even if that might be executed at build time, or to be precise, by a build tool like Maven or Gradle.It’s worth noting that we use the same two-step process, with the same classes, just the same way if the destination database is PostgreSQL.We have the .cql files organized into directories according to our releases.Here you can see that MigrationsBuilder is executed in a maven build. And you can see that the execution for CQL as opposed to SQL differs only by some arguments.Here we can see the output of MigrationsBuilder. MigrationsBuilder creates .sql files in a package structure that Flyway expects. But our .cql files just show up in the root of the classpath. The generated .sql files have the same simple names as the generated .cql files, and those names have been tweaked from the names in source control to comply with Flyway conventions.Contains the CQL script’s contents.This is the dummy, token script that the Flyway class executes with its migrate method.Now, at deploy time, when we go to execute FlywayMigrator against the destination database, you can see that the CQL and SQL invocations are quite similar.Here we see the dependencies for the standalone JAR that’s executed at deploy time. Both JARs depend on the flywayMigrator library. The Cassandra JAR has only one other dependency because it has to support only one keyspace. The PostgreSQL JAR has numerous other dependencies because it has to support multiple schemas along with some migrations and constructs that don’t fit nicely in a schema.Here you can see how the migrations version tracking table for Cassandra has been populated after a FlywayMigrator execution.Now I want to go beyond our own Cassandra migration solution and share some best practices that I’ve arrived at and that I’d recommend however you do your migrations.First, it’s worth keeping in mind the distinction between different kinds of versioning.Regarding effective contract versions, there’s a nice discussion in Chapter 12 of “NoSQL Distilled” of making two schema versions coexist in a running application.Consistent deployment across environments. You should be trying to execute your migrations the same way on a local dev box as you do in production. Or at least isolate the differences.Failure handling: This goes back to the rollback semantics I was describing in beforeEachMigrate. The Flyway contract is every migration up to the migration that failed sticks because every migration up to that failure succeeded.Baselining: If you haven’t been doing formalized database migrations from the get-go, you can use the current state of production as the starting point for your migrations by taking the “describe keyspace” CQL from cqlsh and make that be your initial migration, but only for installations that you want to create from scratch. And if you’ve made a lot of changes to your tables but your migrations haven’t made it to production yet, you can scrap all the history and start from your latest definitions. You get to call a mulligan. Declaring migration bankruptcy.Rollbacks: Something that Liquibase supports. Part of why Liquibase tries to be such a control freak. Flyway, on the other hand, purposely does not support rollbacks. When I first looked into Flyway, that to me was a downside. But I eventually came around to the Flyway way of thinking. You keep progressing forward, even if you’re semantically going backwards. A little like an event sourcing paradigm.The DataStax Java Driver has a nice mechanism for checking that your schema changes have propagated across the entire cluster. This snippet is taken from the DataStax Java Driver documentation.The graphic is showing how a source-driven migration can inevitably expand into incorporating a live-data migration as well. Maybe you’re changing a column or moving from one table to another, and in the process, you need to copy over the data.This isn’t so much a limitation. In a way, it’s a strength. Because we’re doing everything programmatically, there’s nothing stopping us from coupling a live-data migration with a source-driven migration. It’s just an extra amount of complexity to account for.Now here is an actual limitation.The two tables you see represent the same data, but with one having the data clustered in ascending order and the other with the data clustered in descending order. We need to have a time bucket to keep the partitions from growing indefinitely. In the ascending table, we’re able to incorporate the bucket into the partition key. But with the descending table, we want to be able to drop the tables entirely after a certain amount of time. So with those tables, we make the effective bucket part of the table name.The ascending table, where the bucket is part of the partition key—that we’re able to create statically in the migrations. But the descending table we have to create dynamically on the fly in the application. So it falls outside the realm of the migrations. I’m sure there’s a better solution out there; we’re living with this solution for now.Some other considerations…Making it part of the main app is what I believe a lot of teams do.Other use case where you want to migrate not CQL but actual sstables. At this point you might consider storing the data in a filesystem like S3 or even a separate Cassandra cluster.I mentioned Chapter 12 of “NoSQL Distilled,” “Schema Migrations.” Well, Chapter 13 is “Polyglot Persistence.”And the authors proceed to state the obvious, that different databases solve different problems. Relational databases excel at enforcing the existence of relationships. Not good at discovering relationships or pulling data from different tables into a single object. (Of course, these days some folks will say relational databases aren’t good enough at anything to justify their existence, but even then, that doesn’t necessarily mean that Cassandra is the best fit for everything either.)13.5. Choosing the Right Technology"Initially, the pendulum had shifted from specialty databases to a single RDBMS database which allows all types of data models to be stored, although with some abstraction. The trend is now shifting back to using the data storage that supports the implementation of solutions natively.""Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.“Our Flyway-based solution has the promise to be a unified migrations solution for disparate persistence stores. What you see here is the view in PostgreSQL’s pgAdmin3 GUI of our dedicated flyway schema. There are two tables, one for the Cassandra migration versions, the other for the PostgreSQL migration versions. The name of that one is flyway_schema_version; it should really be called postgresql_schema_version. Not that I want to be encouraging persistence store proliferation, but you could see how we could create another table for another RDBMS vendor or for another entirely different type of persistence store.I hope by now you can appreciate that I’m not trying to sell you on our particular solution.I am trying to sell you on the value of source-driven schema migrations for Cassandra, and more broadly on the value of adding automation in building blocks at the right granularity.I’d initially figured this talk would be a better fit for the beginners’ track. It’s not one of the more challenging and exciting things you’ll be doing with Cassandra, but it’s doing the routine, boring things like this which I believe will eventually pay off for you and your work with Cassandra.

Illustration Image
We've had some sexy, exciting, cutting-edge topics today. This is not one of them. ... This is more the sort of routine, good-housekeeping, foundational work that can make the exciting stuff a little less exciting. I’m going to be talking about managing migrations in Cassandra and in particular schema migrations.Let me give a nod to my employer. From the web site: “ GridPoint is a leader in comprehensive, data-driven energy management solutions (EMS) that leverage the power of real-time data collection, big data analytics and cloud computing to maximize energy savings, operational efficiency, capital utilization and sustainability benefits.” The company is based in Arlington, VA, with a development office in Seattle.
Disclaimer… This is my perspective.

Oh, the statue you see is from Bonn, Germany, according to the photographer.

A live-data migration is the process that runs to take the data in one table and adapt it to another table, such that the data in the first table can eventually be retired.

I’m not going to be focusing so much on live-data migrations.

I’m going to be focusing instead on what I would call source-driven migrations.

For schema migrations, think DDL.

The migrations are stored in source control and subject to source control versioning. They may be published to an artifact repository, where they artifact versioning and release versioning can be applied.

I’ll be focusing in particular on schema migrations.

These sorts of problems are covered in depth in this book from the Martin Fowler series that came out in 2006. I can’t speak to containerizing migrations. We haven’t explored that. A couple other established standalone tools are DBMaintain and DBDeploy, although those projects have not been active in recent years. 12.2. Schema Changes in RDBMS
Liquibase, Mybatis Migrator, DBDeploy, DBMaintain
12.3. Schema Changes in a NoSQL Data Store
the schema needs to change frequently in response to changing business requirements | can use similar techniques as with databases with strong schemas
 
with schemalessness at the database level, the burden of supporting the effective schema is shifted up to the application | the application still needs to be able to (un)marshal the data
With this slide, I hope you can see that I’m setting up a bit of a straw man. (A straw man with a strong man.)

There was a StackOverflow thread on schema migration tools for Cassandra (http://stackoverflow.com/questions/25286273/is-there-a-schema-versioning-tool-for-cassandra), and there was an erroneous answer I found amusing:
"Cassandra is by its nature… 'schemaless.' It is a structured key-value store, so it is very different from a traditional rdbms in that regard.”
 
Think about it though. With Cassandra as much as with a relational database, you pay a bitter price for getting your schema wrong.
 
You end up defining a good number of tables.
 
I have the fortune of not having worked much with Thrift. But I know that with Thrift, you'd be in the business of manipulating the contents of messages, which obscures the database's desire to have a schema applied to it.
 
With Thrift, you had super columns and super column families. With CQL, you have collections. But the collections still have to be part of a table. The things that might smack of schemalessness still come back to a schema. ===========================================
Thought experiment. Go into cqlsh and execute:
describe keyspace keyspace_name
 
How big is that output getting? How much is it changing over time?
===========================================
At last month's Cassandra Summit, there was an interesting talk by a company called Reltio, and they described how they were using Cassandra to support "metadata-driven documents in columnar storage." So they produced a keyspace that had a generic table like this. And maybe that schema only had one or two tables. But even they acknowledged that this is an atypical use case for Cassandra.
===========================================
So how have teams been managing their keyspace and table definitions? My anecdotal experience is that whenever the question has come up, teams have usually rolled their own, especially because, on the face of it, or in the simple case, this seems like such a simple thing.
 

Next I want to get into the tools that are out there for Cassandra migrations, and the roadblocks teams have faced trying to manage Cassandra schema migrations via LiquiBase and Flyway.
===========================================
Some history. The obvious way to integrate Liquibase or Flyway with Cassandra comes back to the prospect of the DataStax Java Driver supporting JDBC. There’s this statement from the 2013 announcement of the introduction of the driver (http://www.datastax.com/dev/blog/new-datastax-drivers-a-new-face-for-cassandra): "Today, DataStax announces version 1.0.0 of a new Java Driver, designed for CQL and based on years of experience within the Cassandra community. This Java driver is a first step; an object mapping and a JDBC extension will be available soon…."
Let’s keep that JDBC extension in mind.
===========================================
There was a liquibase-cassandra project that seemed to hit a wall. So some people gravitated toward Flyway.
===========================================
Then there was a GitHub issue for the Flyway project , “Cassandra support.”
https://github.com/flyway/flyway/issues/823
 
In January someone mentions a cassandra-jdbc project that’s out there and which also seems to have hit a wall.
"I …recently looked into adding support for Cassandra to Flyway, but using the existing cassandra-jdbc driver from https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/ , just to see how far I could get. I found a few issues:"
Proceeds to list the issues.
"I disabled or stubbed out code to get past these, but gave up soon after."
 
That same poster referenced a thread he started on the DataStax Java Driver user mailing list.
===========================================
So if we go to that thread, which is from last December (https://groups.google.com/a/lists.datastax.com/forum/#!msg/java-driver-user/kspAx0neZlI/8A59HmYc-rwJ):
Subject: "Timeline for JDBC support?"
 
"Is there any timeline for JDBC support in the DataStax Java Driver for Cassandra, please?"
 
Alex Popescu, Sen. Product Manager @ DataStax responds:
"While I cannot (yet) promise an ETA for JDBC support, what I can say is that it's on our todo list (and very close to the top)."
===========================================
I look forward to seeing how DataStax pulls off the Cassandra JDBC support, but to my mind, trying to do JDBC against Cassandra seems like, I dunno, a bit of an uphill climb.
 
So let's side aside the prospect of first-class Cassandra support in Flyway and see what else is out there.
===========================================
Toward the end of the DataStax Java Driver mailing list thread, someone else chimes in and mentions Pillar, which is a dedicated Cassandra migrations tool written in Scala.
 
And here’s roughly what I wrote in my own internal tool evaluation:
“Before settling on (our) Flyway design for Cassandra schema migrations, I evaluated various open-source Cassandra migration tools. They’re listed below. Of them, the most promising tool was Pillar, which is implemented in Scala. The problem with Pillar vs. (Flyway) was the risk. I was afraid I’d invest time with Pillar and come up emptyhanded, that it wouldn’t deliver the sort of contract I expect from Flyway.” That’s what I wrote. I’m happy we went down the road we did (if I weren’t I wouldn’t be here talking about it), but I’d still maintain that Pillar is worth checking out.
 
There's mutagen-cassandra, which is a Java tool written against the Astyanax driver but which hasn't been adapted to the DataStax Java Driver.
 
Then there are these three Python-based tools: Trireme, cql-migrate, mschematool. Here’s a view of a migrations table that’s responsible for several schemas in PostgreSQL, with PostgeSQL’s concept of a schema, analogous to a keyspace in Cassandra. So let’s get back to the two prominent database migration tools in the relational world.

I think of Liquibase as the Martha Stewart of migration tools. It’s somewhat of a control freak. It wants to do everything itself.

On the other hand, I think of Flyway as the Oprah of migration tools. It provides a framework and then gives you the space to figure things out for yourself.

You see, Liquibase wants to generate the SQL from XML constructs. In the typical usage, the SQL is NOT a first-class citizen. You can define Liquibase migrations as SQL, but even then (to the best of my knowledge) you have to define it inline in the XML.

With Flyway, though, SQL is a first-class citizen. You can make migrations ouf of straight .sql files. It’s Flyway’s lightweight, inobtrusive, extensible approach that’s going to provide the leverage for using it with Cassandra.

So instead of first-class Flway, we’re going to do faked-out Flyway.

The idea is, let Flyway do what it knows, which is migrations. Let Cassandra do what it knows, which is CQL. All we need is an adapter or translator to connect the two.

And one key point. When I say that Flyway knows migrations, I’m saying that Flyway knows migrations in SQL.

So here’s the tradeoff. Or “the weird trick,” to use the parlance of an Internet ad.

Here’s what I wrote in my own internal design doc:
“The reality is that first-class Flyway support for Cassandra doesn’t really gain us anything more than our fake-Flyway solution does, especially considering that we’re fine with persisting the Flyway migrations table to PostgreSQL; once you’re embracing polyglot persistence, you’d realize that a relational database is a better fit anyway for keeping track of the migrations.” 

Failure handling: If a migration produces invalid CQL, the driver throws a RuntimeException. The act of throwing a RuntimeException is the signal I need to tell the JDBC Connection to roll back the transaction. This emulates the JDBC contract where RuntimeExceptions cause the transaction in the actual migrate call to roll back. We do this with the beforeEachMigrate hook so that we have a chance to fail the migration before our dummy, token migration has a chance to run. Flyway will have succeeded with all the migrations up to that point; it will fail only with this particular migration. That preserves the expected Flyway behavior. Our migrations follow a two-step process. At build time, we produce an artifact that gets published to an artifact repository. That’s the work of a proprietary class called MigrationsBuilder. At runtime, we have another custom class called FlywayMigrator that runs the published migrations against the target database.

In the simple case with Flyway, there’s only a single step, the deploy-time step, even if that might be executed at build time, or to be precise, by a build tool like Maven or Gradle.

It’s worth noting that we use the same two-step process, with the same classes, just the same way if the destination database is PostgreSQL.

We have the .cql files organized into directories according to our releases. Here you can see that MigrationsBuilder is executed in a maven build. And you can see that the execution for CQL as opposed to SQL differs only by some arguments. Here we can see the output of MigrationsBuilder. MigrationsBuilder creates .sql files in a package structure that Flyway expects. But our .cql files just show up in the root of the classpath. The generated .sql files have the same simple names as the generated .cql files, and those names have been tweaked from the names in source control to comply with Flyway conventions. Contains the CQL script’s contents.

This is the dummy, token script that the Flyway class executes with its migrate method.

Now, at deploy time, when we go to execute FlywayMigrator against the destination database, you can see that the CQL and SQL invocations are quite similar. Here we see the dependencies for the standalone JAR that’s executed at deploy time. Both JARs depend on the flywayMigrator library. The Cassandra JAR has only one other dependency because it has to support only one keyspace. The PostgreSQL JAR has numerous other dependencies because it has to support multiple schemas along with some migrations and constructs that don’t fit nicely in a schema. Here you can see how the migrations version tracking table for Cassandra has been populated after a FlywayMigrator execution. Now I want to go beyond our own Cassandra migration solution and share some best practices that I’ve arrived at and that I’d recommend however you do your migrations.

First, it’s worth keeping in mind the distinction between different kinds of versioning.

Regarding effective contract versions, there’s a nice discussion in Chapter 12 of “NoSQL Distilled” of making two schema versions coexist in a running application.

Consistent deployment across environments. You should be trying to execute your migrations the same way on a local dev box as you do in production. Or at least isolate the differences.

Failure handling: This goes back to the rollback semantics I was describing in beforeEachMigrate. The Flyway contract is every migration up to the migration that failed sticks because every migration up to that failure succeeded.

Baselining: If you haven’t been doing formalized database migrations from the get-go, you can use the current state of production as the starting point for your migrations by taking the “describe keyspace” CQL from cqlsh and make that be your initial migration, but only for installations that you want to create from scratch. And if you’ve made a lot of changes to your tables but your migrations haven’t made it to production yet, you can scrap all the history and start from your latest definitions. You get to call a mulligan. Declaring migration bankruptcy.

Rollbacks: Something that Liquibase supports. Part of why Liquibase tries to be such a control freak. Flyway, on the other hand, purposely does not support rollbacks. When I first looked into Flyway, that to me was a downside. But I eventually came around to the Flyway way of thinking. You keep progressing forward, even if you’re semantically going backwards. A little like an event sourcing paradigm.

The DataStax Java Driver has a nice mechanism for checking that your schema changes have propagated across the entire cluster. This snippet is taken from the DataStax Java Driver documentation. The graphic is showing how a source-driven migration can inevitably expand into incorporating a live-data migration as well. Maybe you’re changing a column or moving from one table to another, and in the process, you need to copy over the data.

This isn’t so much a limitation. In a way, it’s a strength. Because we’re doing everything programmatically, there’s nothing stopping us from coupling a live-data migration with a source-driven migration. It’s just an extra amount of complexity to account for.

Now here is an actual limitation.

The two tables you see represent the same data, but with one having the data clustered in ascending order and the other with the data clustered in descending order. We need to have a time bucket to keep the partitions from growing indefinitely. In the ascending table, we’re able to incorporate the bucket into the partition key. But with the descending table, we want to be able to drop the tables entirely after a certain amount of time. So with those tables, we make the effective bucket part of the table name.

The ascending table, where the bucket is part of the partition key—that we’re able to create statically in the migrations. But the descending table we have to create dynamically on the fly in the application. So it falls outside the realm of the migrations. I’m sure there’s a better solution out there; we’re living with this solution for now.

Some other considerations…

Making it part of the main app is what I believe a lot of teams do.

Other use case where you want to migrate not CQL but actual sstables. At this point you might consider storing the data in a filesystem like S3 or even a separate Cassandra cluster. I mentioned Chapter 12 of “NoSQL Distilled,” “Schema Migrations.” Well, Chapter 13 is “Polyglot Persistence.”

And the authors proceed to state the obvious, that different databases solve different problems. Relational databases excel at enforcing the existence of relationships. Not good at discovering relationships or pulling data from different tables into a single object. (Of course, these days some folks will say relational databases aren’t good enough at anything to justify their existence, but even then, that doesn’t necessarily mean that Cassandra is the best fit for everything either.)

13.5. Choosing the Right Technology
"Initially, the pendulum had shifted from specialty databases to a single RDBMS database which allows all types of data models to be stored, although with some abstraction. The trend is now shifting back to using the data storage that supports the implementation of solutions natively."

"Encapsulating data access into services reduces the impact of data storage choices on other parts of a system.“

Our Flyway-based solution has the promise to be a unified migrations solution for disparate persistence stores. What you see here is the view in PostgreSQL’s pgAdmin3 GUI of our dedicated flyway schema. There are two tables, one for the Cassandra migration versions, the other for the PostgreSQL migration versions. The name of that one is flyway_schema_version; it should really be called postgresql_schema_version. Not that I want to be encouraging persistence store proliferation, but you could see how we could create another table for another RDBMS vendor or for another entirely different type of persistence store.

I hope by now you can appreciate that I’m not trying to sell you on our particular solution.

I am trying to sell you on the value of source-driven schema migrations for Cassandra, and more broadly on the value of adding automation in building blocks at the right granularity.

I’d initially figured this talk would be a better fit for the beginners’ track. It’s not one of the more challenging and exciting things you’ll be doing with Cassandra, but it’s doing the routine, boring things like this which I believe will eventually pay off for you and your work with Cassandra.

Related Articles

cluster
troubleshooting
datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra