DSE Production node frequent node down issue

Author: Dhaval Bhatt

We have DSE in production divided into two datacenters. One data center is doing spark and one is for SOLR, apart from Cassandra data storage.

Recently we are observing node getting down so frequently that we almost need to spend entire time to observing and up the DSE process.

So far we attempted to remove some old data, we already created a c# console application which fetches data in pegging manner and removes it from the production node just decrease storage load from the node(s).

however, I observed some changes which might affect the performance but I am not entirely sure on that part.

Moved machine domain: we are in process to change domain across the organization. as a part of the process, some machine's domain already changes and some are in progress. Does it affect the internal process when it's come to inter-machine communication as two machines from same datacenter are in a different domain?

frequent data removal process run: as I mention that we created a process which removes old data but as we remove data so it will convert those data into a tombstone and can slow down compaction process which can busy DSE for a longer time and the same time may be scala job trying to run along with client request. this may be hang-up DSE process. If this is the case what will be best for remove old data

total data load vs node count: as of now, we have almost 6tb of data (With replication factor 3) and 15 DSE node (9 for ANALYTICS, 6 for SOLR). Do we need to add some extra machine to handle the node

Originally Sourced from: https://stackoverflow.com/questions/57883984/dse-production-node-frequent-node-down-issue