
Introduction
This post was written by one of Sysdig’s engineers Gianluca Borello and was originally featured on Planet Cassandra.As far as databases go, I’m a huge fan of Cassandra: it’s an incredibly powerful and flexible database that can ingest a massive amount of data while scaling across an arbitrary number of nodes. For these reasons, my team uses it very often for our internal applications.But as with any piece of software, Cassandra has its own quirks and habits that you need to understand in order to effectively develop with it. In this article, I’ll show you a Cassandra performance issue that we recently dealt with, and I want to cover how we spotted the problem, what kind of troubleshooting we did to better understand it, and how we eventually solved it.The problemOne of our internal applications requires storing, and later processing, several thousands streams, where each stream consists of about a dozen binary blobs arriving periodically. Since we have many streams, and each blob can be fairly big (in the range of 10KB), we decided to use Cassandra 2.1 (a very stable version at the time of writing) and this very simple table:CREATE TABLE streams (stream_id int, timestamp int, column0 blob, column1 blob, column2 blob, ... column10 blob, primary key (stream_id, timestamp));In this data model, all the blobs (10 in the above example) for a specific timestamp are stored in the same row, as separate columns. This allowed us to write a very simple application code, consisting essentially of a single write per stream during every period. Our typical read use case requires only one or two blobs at any given time. With this data model we have the flexibility of querying an arbitrary portion of the data with a single query, like this:SELECT columnX from streams where stream_id=STREAMFor a while this schema worked well for us, and the response time we experienced for the database has been very good.Recently we noticed deteriorated performance when querying streams containing particularly large blobs. Intuitively this would seem very reasonable since the data to be processed is larger, but there was something else that felt strange: despite the blobs being bigger in average, the specific ones we were retrieving in our queries were always roughly the same size as before.In other words, it seemed as if Cassandra was always processing all 10 columns (including the large ones) despite us just asking for a particular, small column, thus causing degraded response times. This hypothesis seemed hard to believe at first, because Cassandra stores every single column separately, and there’s heavy indexing that allows you to efficiently lookup specific columns.To validate our hypothesis, we wrote a separate test: in a table of N columns like ours, asking one single column should always take almost the same time regardless of the number and size of the other columns. With a little script, we got these results:$ python ./cassandra_benchmark_1.py Response time for querying a single column on a large table (column size 100 KB): 10 columns: 185 ms 20 columns: 400 ms 30 columns: 613 ms 40 columns: 668 ms 50 columns: 800 ms 60 columns: 1013 ms 70 columns: 1205 ms 80 columns: 1376 ms 90 columns: 1604 ms 100 columns: 1681 msWe couldn’t have been more wrong! The tests proved the exact opposite of our assumption, and in fact the response time seemed to take a time directly proportional to the number of columns in the table, even if the query asked for just one of them! Digging into CassandraIn order to understand the problem better, I used sysdig, an open source troubleshooting tool. If you’re not familiar with sysdig, it has the ability to capture system state and activity from a running Linux instance, and then save, filter and analyze it. Think of sysdig as strace + tcpdump + htop + iftop + lsof in one.Back to our story: I took a sysdig trace file while executing the same query of the previous test on a table with 100 columns:SELECT column7 from streams where stream_id=1This query returns a very small amount of data compared to the whole data set (around 10 MB), as sysdig can easily tell me by looking at the network activity generated by the database:$ sysdig -r trace.scap -c topprocs_net Bytes Process PID ------------------------------------------------------------------------- 9.82M java 34323Despite this, the query takes almost 4 seconds to run, which is way more than what we want to wait in a similar scenario. Let’s take a look at the Cassandra file I/O activity while serving this single query: $ sysdig -r trace.scap -c topfiles_bytes Bytes Filename -------------------------------------------------------------------------------- 971.04M /var/lib/cassandra/data/benchmarks/test-23182450d5c011e5acecb7882d261790/benchmarks-test-ka-130-Data.db 538.80KB /var/lib/cassandra/data/benchmarks/test-23182450d5c011e5acecb7882d261790/benchmarks-test-ka-130-Index.db …Wow, Cassandra seems to have read almost 1 GB from the file storing the data for my table, pretty much the whole size of the table:$ du -hs /var/lib/cassandra/data/benchmarks/test-23182450d5c011e5acecb7882d261790/* ... 972M /var/lib/cassandra/data/benchmarks/test-23182450d5c011e5acecb7882d261790/benchmarks-test-ka-130-Data.db …Which means that Cassandra essentially read the entire file, and this probably explains why the response time depends on the total number of columns of the table.To take a deeper look, I used the spectrogram within csysdig on the I/O events over the duration of the trace file, so that we can visualize Cassandra’s latency and frequency of I/O operations:$ csysdig -d 100 -r trace.scap -v spectro_file



stream_id int, column_no int, timestamp int, column blob, primary key (stream_id, column_no, timestamp));With this new schema all our rows are much smaller while still allowing us to efficiently query a single portion of the blob, with a query such as:SELECT * from streams where stream_id=1 and column_no=7This approach turned out to be quite effective: the query above that took 4 seconds with the old data model in the extreme test case, now took just around 100 ms on the same data set! It’s also interesting to analyze the new scenario with sysdig, and check Cassandra’s I/O while serving the request:$ sysdig -r trace.scap -c topfiles_bytes Bytes Filename -------------------------------------------------------------------------------- 9.85M /var/lib/cassandra/data/benchmarks/test-55d49260d5cb11e5acecb7882d261790/benchmarks-test-ka-16-Data.db …Just 10 MB, exactly the same size of the expected response. We can also use sysdig to answer the question: how did Cassandra know how to efficiently read the exact amount of data in a jungle of more than 1 GB? We can of course look at the system events done by the database process on the file:$ sysdig -r trace.scap fd.filename=benchmarks-test-ka-16-Data.db 11285 15:13:40.896875243 1 java (34482) < open fd=59(/var/lib/cassandra/data/benchmarks/test-55d49260d5cb11e5acecb7882d261790/benchmarks-test-ka-16-Data.db) name=/var/lib/cassandra/data/benchmarks/test-55d49260d5cb11e5acecb7882d261790/benchmarks-test-ka-16-Data.db flags=1(O_RDONLY) mode=0 11295 15:13:40.896926004 1 java (34482) > lseek fd=59(/var/lib/cassandra/data/benchmarks/test-55d49260d5cb11e5acecb7882d261790/benchmarks-test-ka-16-Data.db) offset=71986679 whence=0(SEEK_SET) 11296 15:13:40.896926200 1 java (34482) < lseek res=71986679Here we see Cassandra opening the table file, but notice how immediately there’s a lseek operation that essentially skips in one single operation 70 MB of data, by setting the offset to the file descriptor with SEEK_SET to 71986679. This is essentially how typical “file indexing” works when observed from the system call point of view: Cassandra heavily relies on data structures to index the various contents of the table, so that it can move fast and efficiently to arbitrary and meaningful locations. In this case, the index contained the information that the columns with “stream_id=1” and “column_no=7” started at offset 71986679.11400 15:13:40.898199496 1 java (34482) > read fd=59(/var/lib/cassandra/data/benchmarks/test-55d49260d5cb11e5acecb7882d261790/benchmarks-test-ka-16-Data.db) size=65798 11477 15:13:40.899058641 1 java (34482) < read res=65798 data=................................................................................Right after jumping to the correct position of the file, we see normal sequential reads of 64 KB, in order to bring all the data into memory and process them. This loop of jumping to the right position and reading from it continues until all data (10 MB) is fully read.It’s also quite interesting to see how this second case looks like in a spectrogram:
