No benchmarks to back this up today, but I’ve done lots of them in the past and perhaps it’s time as a community we can drill down on these numbers more in a scientific set of tests than just fixing a bunch of different clusters over a couple of years.
During my time at DataStax I was lucky enough to work on a wide range of hardware, this was in no small part because as our customers base and priorities shifted and therefore the hardware I got to work with changed dramatically.
Originally, I would only occasionally do heavy tuning for compaction with the high end (albeit commodity) hardware used by my first customers, the defaults were often good enough to get work done and for higher throughput use cases we’d tune up compaction throughput.
Fast forward to the past year and a lot of my customers have pre-templated hardware they use as an organization no matter how horrible it is for database workloads and changing anything is a year long slog of meetings with various tiers of decision makers. In the meantime I had to tune a substandard hardware cluster to work well enough to keep the cluster up. What follows are my solutions to this and some the rationale for the why, many of you won’t agree with me but I suggest testing this first as I’ve done many iterations and when other bottlenecks are removed these approaches have proved reliable.
Solutions for SANs no matter how expensive.
- Stop trying to do Big Data.
- Probably shouldn’t be interested in a distributed system if you’re set on centralized storage. This is an existential question to me.
Solutions For Spindle Die Hards
- Reduce contention- set concurrent_compactors to 1 if it isn’t already (you can get away with higher if you’re say using JBOD, but that’s difficult to do with spindles). This not only will free up a lot of CPU, it’ll be less concurrent IO fighting over spindle position. The following issue discussing some of the bad stuff that can happen when this is too high or on the old default , CASSANDRA-7139- Default concurrent_compactors is probably too high.
- Set compaction throughput appropriately- Keep raising/lowering the number till you hit a sweet spot on pending compactions. The “sweet spot” is as low as you can go, more than 10 and I see that as suboptimal, more than a 100 and I worry the cluster will eventually start a slow degradation into oblivion, this number varies based on workload, gc tuning and hardware profile so do not take this as law and establish your own baselines if for some reason you find this egregious (See below the “Note about pending compactions calculation” for versions 2.0.16 and 2.1.7). This can be as low as 10mb or as high as 300 (I have a RAID 0 cluster with 12 drives and a strong controller handling this fine).
- If you’re on 2.0 or older then don’t use multi-threaded compaction no matter what anyone tells you, they even removed it in version 2.1 CASSANDRA-6142
Solutions for sub 8 cpu core (non hyper-threaded) systems with local SSDs
- Reduce CPU usage first- You’re going to be CPU bound before your IO bound so reduce concurrent compactors to no more than 2 and make sure internode_compaction: dc is set. Fix your GC (easy mode is a java8, a decently large heap which now seems to be 20gb+ with current testing methods) and flush writer settings to not overwhelm CPU either.
- Crank compaction throughput- Raise throttling up to the level that you get backed up- 250 is not uncommon for SSD based setups with even the most expensive compaction strategy (see point 4). You must be aware of GC tuning as well (see above point).
- Also as noted above do not use multi-threaded compaction in 2.0 and older systems.
- Consider using Size Tiered Compaction Strategy (STCS)instead of Leveled Compaction Strategy (LCS). While LCS has much better read characteristics than STCS, it comes at a greater cost on compaction and LCS is quite a bit more expensive.
Solutions for high CPU (more than 8 non-ht cores) systems with local SSDs
- As above crank compaction throughput.
- Go ahead and test with higher concurrent compactors than the default. It will often give you better results if you have the CPU to handle it. However, bear in mind you are consuming resources that could be used for flushing to disk faster.
A note about unthrottled compaction
All the time I see people just unthrottle compaction and while that will often get “people out of trouble” and so there is a confirmation bias that occurs that it must be the best way to solve problems, I’d argue it rarely ever is. If you read the original throttling Jira you can see that there is a hurry up and wait component to unthrottled compaction (CASSANDRA-2156- Compaction Throttling). Ultimately you will saturate your IO in bursts, backing up other processes and making different bottlenecks spike up a long the way, potentially causing something OTHER than compaction to get so far behind that the server becomes unresponsive (such as GC).
Before you get into a state where you throw your hands up and unthrottle compaction you should have some idea of what your I/O CAN handle and compaction throughput should be set to just under that level
Note about pending compactions calculation
2.0.16 or 2.1.7 the calculation skyrockets into the 10s of thousands on an otherwise bored cluster. The following issues are related to this and it was fixed in later revisions. I still have not tested this closely and so you may have to take my pending compaction warning level of 100 with a grain of salt on newer versions until I have a better baseline.
CASSANDRA-9592- Periodically attempt to submit background compaction tasks
CASSANDRA-9662- CompactionManager reporting wrong pending tasks
CASSANDRA-11119- Add bytes-to-compact estimate. TLDR STCS is more accurate than LCS wrt pending compactions. I view pending compactions as a bit of binary state though and I’m less interested in the number beyond hitting warning levels. Should be something you keep in mind when tuning LCS however.