Apache Cassandra is great for handling huge volumes of data. Everything works really great when you know your data patterns up front and you can make certain decisions based on that experience.
We have recently worked for a marketing company which serves ads based on user behavior. The assumption of the use case and data patterns was that the majority of users (effectively UUIDs stored in DB) click on ads in a short time period. A user comes to a certain website, software writes UUID about that user in DB and each click on an ad is an impression counted towards the profile built in the background system for that user (this way the system slowly builds preferences for each user and serves better ads to each one of them). After those first 5 minutes, in most cases, the user usually disappears. There are returning users but they account for less than 1% of overall traffic.
This use case explanation is not that important for the debugging SSTables story but it is important for certain decisions we made while configuring Cassandra, and it will explain why we had to go to SSTable level and search for our data. We use Cassandra 3.x with all the tools that come with it for sstable manipulation. Based on the above explanation we have chosen a new TimeWindowCompaction strategy, which is still new and not that widely used. We have chosen a time window of 1 day and TTL of 30 days. Our reasoning was like this: 99% of our reads will be served from the latest SSTable which will be in Page Cache of our system (since reads happen in a short time window after write and during the same day). TWCS seemed like a good fit for this use case. A certain percent of returning users which are matched more than one day will be read as non-optimal from many SSTables but we can live with it since they represent 1% of our reads, right?
Well not really, since after a few days of collecting and serving data, we were pretty surprised by the results. Our nodetool tablehistograms command showed that we were touching 10 SSTables after 10 days of collecting data for 50% of our reads. We saw increased latency numbers since this was 10 disk seeks and going to disk for read always means more latency in comparison to narrowing down options or reading directly from memory. This raised an alarm for us to further research our data patterns and where our data was. This was a good opportunity to figure out what is out there from sstable tools and to write this short review of all the options.
Naturally, the first instinct is that nodetool has many useful commands to figure out where your data is in more detail. We were mining our incoming data and we extracted a couple of UUIDs which were showing increased latency on read. These were good candidates for further investigation. The experiment explained below shows the exact steps we took to prove that UUIDs which have increased latency were really spread among many sstables and read from disk was killing our performance.
Nodetool tablehistograms [keyspace] [table]
This is the first useful command. Please note that from Cassandra 3.x they renamed command but left both to work, where the previous was cfhistogram and now it is tablehistograms; same goes for cfstats and tablestats. Basically, for a certain table it will list useful stuff, for each percentile it will show latency, the number of sstables it touches on each read, partition size, cell count. Output is great as the first indicator that something is wrong. In our case, it showed that we hit 10 SStables for 50% of our reads.
centos@cassandra-c-2]$ nodetool tablehistograms keyspace table;
keyspace/table histograms Percentile SSTables Write Latency Read Latency Partition Size Cell Count (micros) (micros) (bytes) 50% 10.00 14.24 182.79 103 1 75% 10.00 17.08 315.85 149 2 95% 10.00 20.50 7007.51 372 7 98% 10.00 24.60 10090.81 642 12 99% 10.00 29.52 12108.97 770 14 Min 8.00 3.31 29.52 43 0 Max 13.00 1358.10 62479.63 1597 35
Nodetool tablestats [keyspace]
The second step which we usually do after tablehistograms show that something is off. It will give you as an output a nice summary about all the tables inside keyspace. It can show the total SSTable count, meta data about memtables, bloom filters, information about cells in each row, number of keys etc. In our case, it showed that we had a total of 12 SSTables and when we correlated this to output from cfhistograms we could conclude that we were touching almost all SSTables when doing reads. Also, an important thing to note here are the Bloom filter false positives and Bloom filter false ratio. Bloom filter is a probabilistic structure which tells Cassandra which SSTables do not have data, and false positive is a hit of SSTable that does not have data. Looking at these two figures, we can conclude that we do not have problems with bloom filters, they are not telling Cassandra to look at SSTables which do not have data, since the ratio is 0. It clearly shows that we are really hitting 10 SSTables because our data is really in 10 SStables.
Keyspace: keyspace Read Count: 72628768 Read Latency: 0.7189103991823186 ms. Write Count: 652226331 Write Latency: 0.012940205744315465 ms.
Pending Flushes: 0
Table: table
SSTable count: 13
Space used (live): 99252365458
Space used (total): 99252365458
Space used by snapshots (total): 0
Off heap memory used (total): 892785184
SSTable Compression Ratio: 0.0
Number of keys (estimate): 279335104
Memtable cell count: 710670
Memtable data size: 40221377
Memtable off heap memory used: 0
Memtable switch count: 903
Local read count: 72628771
Local read latency: 0.981 ms
Local write count: 652226458
Local write latency: 0.015 ms
Pending flushes: 0
Bloom filter false positives: 1
Bloom filter false ratio: 0.00000
Bloom filter space used: 759080720
Bloom filter off heap memory used: 759080536
Index summary off heap memory used: 133704648
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 1597
Compacted partition mean bytes: 150
Average live cells per slice (last five minutes): 13.261506069748489
Maximum live cells per slice (last five minutes): 35
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Sstablemetadata utility
Cassandra is shipping a couple of SSTable utility tools which can help you with debugging stuff about SSTables. One of them is sstablemetadata. All details can be found on DataSTax website. It will print valuable information about a given SSTable such as the minimum timestamp, maximum timestamp, estimated droppable tombstones, further details about tombstones etc. Thanks to Alex Dejanovski and his blog post about TWCS - how does it work and when should you use it on The Last Pickle website we managed to get metadata about all SSTables in our data folder which are great for TWCS and debugging if minimum and maximum timestamps are really same as expected by configured bucket.
for f in *Data.db; do meta=$(sudo sstablemetadata $f); echo -e "Max:" $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') "Min:" $(date --date=@$(echo "$meta" | grep Minimum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done | sort
What this command will do is it will loop through all SSTAble data files, it will apply sstablemetadata and pull the min and max timestamp out of them, number of droppable tombstones and some file attributes from thels command. Our output looks like this and shows that each of our tables is holding really 1 day of data except for the latest SSTables which are working on SizeTieredCompaction still until the day is finished.
[centos@cassandra-c-3]$ for f in *Data.db; do meta=$(sudo sstablemetadata $f); echo -e "Max:" $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') "Min:" $(date --date=@$(echo "$meta" | grep Minimum\ time | cut -d" " -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done | sort
Max: 01/01/2017 Min: 12/31/2016 Estimated droppable tombstones: 0.0 5.0G Jan 6 08:30 mc-647149-big-Data.db
Max: 01/02/2017 Min: 01/01/2017 Estimated droppable tombstones: 0.0 4.7G Jan 6 08:26 mc-647148-big-Data.db
Max: 01/03/2017 Min: 01/02/2017 Estimated droppable tombstones: 0.0 3.7G Jan 6 08:18 mc-647147-big-Data.db
Max: 01/04/2017 Min: 01/03/2017 Estimated droppable tombstones: 0.0 3.7G Jan 6 08:16 mc-647146-big-Data.db
Max: 01/05/2017 Min: 01/04/2017 Estimated droppable tombstones: 0.0 5.6G Jan 6 08:29 mc-647145-big-Data.db
Max: 01/06/2017 Min: 01/05/2017 Estimated droppable tombstones: 0.0 4.0G Jan 7 00:28 mc-647404-big-Data.db
Max: 01/07/2017 Min: 01/06/2017 Estimated droppable tombstones: 0.0 5.5G Jan 8 00:25 mc-647749-big-Data.db
Max: 01/08/2017 Min: 01/07/2017 Estimated droppable tombstones: 0.0 5.1G Jan 9 00:39 mc-648048-big-Data.db
Max: 01/09/2017 Min: 01/08/2017 Estimated droppable tombstones: 0.0 1.7G Jan 9 12:38 mc-648130-big-Data.db
Max: 01/09/2017 Min: 01/09/2017 Estimated droppable tombstones: 0.0 112M Jan 9 12:51 mc-648135-big-Data.db
Max: 01/09/2017 Min: 01/09/2017 Estimated droppable tombstones: 0.0 31M Jan 9 12:55 mc-648136-big-Data.db
Max: 01/09/2017 Min: 01/09/2017 Estimated droppable tombstones: 0.0 31M Jan 9 13:00 mc-648137-big-Data.db
Max: 12/31/2016 Min: 12/30/2016 Estimated droppable tombstones: 0.0 5.2G Jan 6 08:31 mc-647150-big-Data.db
Nodetool getendpoints [keyspace] [table] [primary_key]
After we concluded with tablehistograms that we were reading from many SSTables and that bloom filter worked correctly, we wanted to prove that for problematic UUIDs (UUIDs showing increased latency) partition was really spread in many SSTables. So we took a couple of UUIDs to do further research. Getendpoints is a useful command from the nodetool utility which tells you which machine has a certain primary key. Before we get to SSTables,we must go to the machine which holds our data.
[centos@cassandra-c-1 ~]$ nodetool getendpoints keyspace table 71cc1d9a-c161-485b-a2a7-ba0fbfb34bef; cassandra-c-1 cassandra-a-2 Cassandra-b-3
Great, now we know that our UUID is on any of those 3 nodes and we can go and apply the next nodetool command, getsstables on it to see in which SSTables our UUID really is.
Nodetool getsstables [keyspace] [table] [primary_key]
A really cool command; for each primary key it will tell you in how many and in which SSTables on disk the data is. Based on all the previous stuff, we wanted to really prove that UUIDs which cause the biggest latency were spread among many SSTables on disk. We were using a list of UUIDs which had increased latency.
[centos@cassandra-c-1 ~]$ nodetool getsstables table keyspace 71cc1d9a-c161-485b-a2a7-ba0fbfb34bef;
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571812-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-572698-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-572403-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571804-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571808-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571807-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571809-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-571805-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-573006-big-Data.db
/mnt/cassandra/data/data/keyspace/table-62f30431acf411e69a4ed7dd11246f8a/mc-573001-big-Data.db
This really shows that this particular UUID is spread across 10 SSTables which can further mean that UUID is written each day (since we use TWCS with 1 day bucket, so this means that UUID is in each daily SSTable and when this UUID is read it needs to be combined out of 10 SSTables).
Cqlsh tracing
Great, now we know we have UUID spread across many SSTables. But what does this mean for our read path?The easiest thing to conclude is to fire up CQLSH and enable tracing. Tracing is a cool feature of Cassandra, which if it is on can show you the trace of full read path of each query (please note that this is expensive so turn it on only when you are debugging). After tracing on all the following commands will be traced, so you can run a select query with UUID we just proved to be in 10 tables and we can see that the read path is really merging data from 10 SSTables to show the result.
Merging data from memtables and 10 sstables [SharedPool-Worker-2] | 2017-01-10 10:39:16.092003 | 10.200.254.23 | 1328
So this message as one in tracing suggests exactly what we proved, that 10 SSTables are merged for read when we use UUID which we proved to be in 10 SSTables.
sstable-tools
Browsing the internet, we stumbled upon sstable-tools - a collection of really cool and useful sets of tools which can help with SSTables. Andy Tolbert (developer from DataStax) developed it with the idea to include it in Cassandra 4.0 and ship it with the product. Among other useful features, it provides you with a cqlsh shell, similar to the one Cassandra uses to query tables, but this one works with SSTable scope and queries only inside that SSTable. Really useful when you want to figure out what is inside that single file. It has the describe sstable command which gives a similar output as sstablemetadata but in a more structural and nicer presentation way. Basically it shows a histogram of cell counts, histogram of partition size, tombstone leader, min and max timestamp. Really great when debugging tombstones and wide partitions.
Conclusion
In the above text we showed our debugging process and how we really proved that data we were getting with increased latency were spread among many SSTables, which directly affected our read path. We have a sensitive latency use case and the compaction we chose should have narrowed down our disk seek to a single file (the most recent one) but it did the completely opposite thing, it worked through many SSTables, therefore many disk seeks for each read to return a single primary key. This was a clear indicator that we did not know our data access patterns and that we really did not have a clear data access pattern. A better choice for this would be LeveledCompactionStrategy, given that it does a better job for data being updated on more occasions and it keeps data on levels which guarantee less SSTable reads. After we had switched, we saw a drop to 1-3 SSTables for each read which improved our performance significantly (we have taken notice of a 2 times latency improvement on 99.9% of requests).
How do you debug data in SSTables, do you have any cool tools that you use and which are not mentioned here to figure out where is your data stored on disk?