Cassandra 3.11.19 to 4.x upgrade: node failure during rolling upgrade and replace_address causing streaming failures
Author: 조현재
Originally Sourced from: https://stackoverflow.com/questions/79922550/cassandra-3-11-19-to-4-x-upgrade-node-failure-during-rolling-upgrade-and-replac
I am testing a rolling upgrade from Cassandra 3.11.19 to 4.x on a 6-node cluster.
Total nodes: 6
Seed nodes: node 1 and node 3
Replication is active and data is continuously being written (real-time workload)
Scenario
During the upgrade process, if a node goes down unexpectedly (e.g., due to an AWS issue), I am trying to recover it using replace_address.
However, when I attempt to use replace_address, I encounter streaming failures, and I am unsure how to proceed.
Questions / Concerns
Is
replace_addressthe correct approach in this situation during a rolling upgrade?What is the recommended recovery strategy if a node fails mid-upgrade (especially when mixing 3.11 and 4.x nodes)?
Are there known issues or limitations with streaming between 3.11 and 4.x when using
replace_address?Would it be safer to:
remove the failed node and bootstrap a new one, or
restore from backup (e.g., Medusa) and rejoin the cluster?
This is just a test cluster (6 nodes), but in production I need to upgrade around 300 nodes.
The system is under continuous write load, so downtime or data inconsistency is a major concern.
Even in this small test setup, I am struggling to define a reliable recovery strategy.
I would appreciate guidance on best practices or recommended approaches for handling node failures during a rolling upgrade in Cassandra.