Nodes will frequently be marked down with the following message:
ERROR [MessagingService-Incoming-/192.168.1.34] 2019-09-12 10:52:54,054 MessagingService.java:823 - java.util.concurrent.RejectedExecutionException while receiving WRITES.WRITE from /192.168.1.34, caused by: Too many pending remote requests!
I have tried narrowing the problem down and I think its memtable related.
Because when I run a nodetool tpstats it shows TPC/all/WRITE_MEMTABLE_FULL having approx 4000 active.
Looking through opscenter graphs I can see that TP: Write Memtable Full Active is at approx 4000 and TP: Gossip Tasks Pending keeps increasing and OS: CPU User sits at 80%.
I have also noticed commit log directory exceeds what we set of 16GB and have seen a node with 7911 files(248GB)
The nodes were previously set to have
commitlog_segment_size_in_mb: 32 memtable_heap_space_in_mb: 4096 memtable_offheap_space_in_mb: 4096 memtable_cleanup_threshold: 0.5 commitlog_total_space_in_mb: 16384 memtable_flush_writers: 2
I am currently trying to use these new settings on one of the nodes to see if it fixes the servers performance
memtable_heap_space_in_mb: 10240 memtable_offheap_space_in_mb: 2048 memtable_cleanup_threshold: 0.11 memtable_flush_writers: 8
If anyone has any ideas on what else I can look at.
Also how do I view MemtablePool.BlockedOnAllocation metric that it details against the memtable_flush_writers section in http://cassandra.apache.org/doc/latest/configuration/cassandra_config_file.html