The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

Cassandra 3.11.19 to 4.x upgrade: node failure during rolling upgrade and replace_address causing streaming failures

I am testing a rolling upgrade from Cassandra 3.11.19 to 4.x on a 6-node cluster.

  • Total nodes: 6

  • Seed nodes: node 1 and node 3

  • Replication is active and data is continuously being written (real-time workload)

Scenario

During the upgrade process, if a node goes down unexpectedly (e.g., due to an AWS issue), I am trying to recover it using replace_address.

However, when I attempt to use replace_address, I encounter streaming failures, and I am unsure how to proceed.

Questions / Concerns

  1. Is replace_address the correct approach in this situation during a rolling upgrade?

  2. What is the recommended recovery strategy if a node fails mid-upgrade (especially when mixing 3.11 and 4.x nodes)?

  3. Are there known issues or limitations with streaming between 3.11 and 4.x when using replace_address?

  4. Would it be safer to:

    • remove the failed node and bootstrap a new one, or

    • restore from backup (e.g., Medusa) and rejoin the cluster?

  • This is just a test cluster (6 nodes), but in production I need to upgrade around 300 nodes.

  • The system is under continuous write load, so downtime or data inconsistency is a major concern.

  • Even in this small test setup, I am struggling to define a reliable recovery strategy.

I would appreciate guidance on best practices or recommended approaches for handling node failures during a rolling upgrade in Cassandra.