UPDATE:
This list needs to be updated and as of today only has been verified with Cassandra 2.0.
Original Blog Post:
This isn’t remotely complete, but I had a colleague ask me to do a brain dump of my process and this is by and large usually it. I’m sure this will leave more questions than answers for many of you, and I’d like to follow up this post at some point with detail of the why and the how of a lot of it so it can be more useful to beginners. Today this is in a very raw form.
- logs
- cassandra.yaml
- cassandra-env.sh
- histograms
- tpstats
- schema of all tables
- nodetool status
- ulimit -a as user that cassandra is running as (make sure it matches our settings)
- heap usage under load, is it hitting 3/4 of MAX_HEAP ?
- Pending compactions (opscenter, will have to look up JMX metric later)
- size of writes/reads
- max partition size (histograms will say)
- tps per node
- list of queries run against the cluster
- ERROR
- WARN
- dropped
- GCInspector (see parnews over 200ms and CMS )
- Emergency
- Out Of Memory
- heap not set to 8gb
- parnew no more than 800mb (unless using tunings from https://issues.apache.org/jira/browse/CASSANDRA-8150)
- row cache being enabled (unless 95% read with even width rows)
- vnodes set with solr
- system_auth keyspace still with RF 1 and SimpleStrategy
- flush_writers set to crazy high level (varies by disk configuration, follow documentation advice, double digits is suspect)
- rcp_address: 0.0.0.0 (slows down certain versions of the driver)
- multithreaded_compaction: true (almost always wrong)
- double hump
- long long tail
- partitions with cell count over 100k
- partitions with size over in_memory_compaction_limit (default 64mb)
- dropped (anything) especially mutations.
- blocked flush writers (if all time is in the 100s it’s usually a problem)
- STC compaction in use on SSD when customers have a low read SLA (or no defined one).
- Using RF less than 3 per DC
- Using RF more than 3 per DC
- Using SimpleStrategy with multiDC
- read_repair_chance and dclocal_read_repair_chance adding up to more than 0.1 is usually a bad tradeoff.
- Secondary indexes in use (on writes think write amplification, and on reads think synchronous full cluster scan).
- Is system_auth replicated correctly? And has it has repair run after this was changed? If you see auth errors in the log..the answer is probably no
- use of racks is not even (4 in one rack and 2 in another..that’s a no no)
- use of racks is not enough to fulfil muliple of RF (if you have 2 racks of 2 and RF 3..how will that get evenly laid out?).
- if load is wildly off. may not mean anything, but go look on disk, if the cassandra data files are really imbalanced badly figure out why.
- Run cstress as a baseline
- Run cstress with appications write size this will often identify bottlenecks
- Do math on writes and desired TPS. Are the writes saturating the network? Don’t forget bits and bytes are different 🙂
- Lower ParNew for lower peak 99th, now this flies in the face of what is happening with this Jira (https://issues.apache.org/jira/browse/CASSANDRA-8150), but until I’ve worked through all of that ParNew lower than 800MB is generally a good way to tradeoff throughput for smaller ParNew GCs.
- Run Fio with the following profile https://gist.github.com/tobert/10685735 (adjusted to match users system). Using these as baseline (http://tobert.org/disk-latency-graphs/).
Once you’ve established the system is awesome. Review queries and code.
- Things to look for BATCH keyword for bulk loading (fine for consistency, but has to take into account SLA hit if the writes are larger than BATCH can handle).
- If using batches what is the write size?
- Are you using a thrift driver and destroying one or two nodes because of bad load balancing?
- Are you using the DataStax driver, is it using the token aware policy (shuffle on with 2.0.8 ideally)?
- If using the DataStax driver, is it the latest 2.0.x Lots of useful fixes in each release, it really matters.
- Using LWT? They involve 4 round trips and so ..while they’re awesome they’re slower.