An aspiring Cassandra engineer-apprentice was fiddling with a Cassandra cluster trying to fetch the data he needed. For a while, he was receiving strange responses from the server. But after hacking his way through the CQL, he finally received the response he was looking for. He felt so proud... For a moment. Then, out of nowhere, a senior engineer appeared next to him. She was looking at the apprentice with a mix of fear, anger, and disappointment.
"What have you done?! Have I not told you not to use ALLOW FILTERING in the queries?!"
"But I just wanted to fetch one little piece of data, and the output of the query suggested that I should enable ALLOW FILTERING if I wanted the data. So I did."
"You fool! You doomed us all!" — but the senior engineer's voice was silenced by the slow-query alarms. She looked at the apprentice with terror in her eyes. The explosion of their data center echoed for centuries.
Pictured above is the explosion of the datacenter caused by 'ALLOW FILTERING', New Year's Eve 2017, colorized.
The Curious Case of ALLOW FILTERING and Cassandra’s Data Model
"And that's why you should never use 'ALLOW FILTERING' in your queries, Sofia." I had finished telling the ancient fable to my daughter. She was still playing with Lego.
"Never, father? Did you not tell me that only Sith deal in absolutes?" Sofia responded without stopping building a birthday cake (using the Lego Elves set, damn it!)
"You're right, my child." And then, I remembered the teachings.
There is a case when 'ALLOW FILTERING' could be used in production. And that is when 'ALLOW FILTERING' is used with the query that is already limited to only one partition.
But what is the performance of such a query? Would 'ALLOW FILTERING', even within one partition, still have a heavy impact on the performance? Let’s measure it!
"To the laboratory!" I yelled, excited. Sofia rolled her eyes and followed me, unenthusiastically, to our testing ground.
"In order to test the performance of the query with ALLOW FILTERING, I’ll spin up a three-node Cassandra cluster in the AWS," I started explaining patiently,
"Then, I’ll create the following table:
CREATE TABLE users_by_tag (
tag text,
last_name text,
first_name text,
id uuid,
age int,
email text,
PRIMARY KEY ((tag), last_name, first_name, id)
Which will simulate the categorization of users into tags. 'tag' is a partition key and the rows within a partition are sorted by users' last_name and first_name. So, for one tag, e.g. 'premium', there'll be many users. Then we'll compare the execution of the following queries:
SELECT * FROM users_by_tag WHERE tag = 'premium';
SELECT * FROM users_by_tag WHERE tag = 'premium' AND age > 35 ALLOW FILTERING;
Before we continue, a question: what would happen if we omitted ALLOW FILTERING in the second query, my little apprentice?" I asked and raised an eyebrow.
"Cassandra would not allow it." Sofia responded correctly.
"Exactly! Why?" I asked with pride in my voice.
"Because of how the data is stored on the disk; we can’t filter on 'age' property, because it isn’t part of the primary key."
"Exactly! Why?" I was on fire.
"Please move on, dad."
"Okay, okay, we’ll deep dive into the explanation some other time. Meanwhile, read this post."
While I was on fire, the cluster was up and running:
Note that I took a screenshot after I filled the data, that’s why there is ~370Mb of load.
I created a test keyspace with replication factor (RF) 3. RF=3 plays nicely with the QUORUM consistency level, because QUORUM, in that case, needs a response from two out of three replicas.
"Now to fill it with data. Pass me the Ranger and Berserker tools, please... Or should I say R&B!" I yelled, excited, again.
"But can’t you just use the cassandra-stress tool, dad?"
"I could, but can cassandra-stress target particular partitions with the read queries? Do you enjoy reading a bunch of random strings from the tables? Can you declaratively describe your data set and pass the values around in the configuration files with cassandra-stress? Can you..."
"Okay, okay, I get it. Your friends at SmartCat and you created the most awesome load generator/test tool ever," she rolled her eyes again. I’ll have to do something about this eye-rolling behavior…
Anyway, a year ago, while performing tests and experiments, we were constantly wondering where to find test data and how to create large volumes of meaningful test data in order to measure the performance of the cluster. That’s why we developed and open sourced Ranger (data generator) and Berserker (load generator). Pull requests are welcome, by the way.
Unlike cassandra-stress, Ranger and Berserker generate sensible data in a flexible manner. This is how the generated data looks like:
I was too lazy to bother with the declaration of the email field because that one is not important for this test. I should have made it more real world like, though.
We specified the tag to have this form:
tag: string('{}_{}', $tag_base, randomContentString(2, ['A'..'Z']))
Where tag_base is a random word from predefined set of strings declared above like this:
tag_base: random(['regular', 'semi', 'drama', 'frequent', 'graceful','peaceful', 'nice', 'naughty', 'sudoers', 'male', 'female', 'elf', 'bot', 'lamp', 'linux', 'window', 'mac', 'buyer', 'lurker', 'joker', 'runner'])
With this, we can generate real-world like (but made up) classification of the users in groups/tags (e.g. 'regular') and subgroups (e.g. 'AA').
"Look, we generated millions of rows, Sofia! Look, I say!"
Do not run count(*) in production, this is just for showing off.
Performance Test
"Okay, kiddo, let’s compare the performance of the queries."
SELECT * FROM users_by_tag WHERE tag = 'premium';
SELECT * FROM users_by_tag WHERE tag = 'premium' AND age > 35 ALLOW FILTERING;
Just as a reminder, this is the primary key:
PRIMARY KEY ((tag), last_name, first_name, id)
The first query fetches all the rows for the particular partition (particular partition — that sounds nice). Cassandra executes this query by linearly scanning the partition (tag = ‘premium’) on the disk, scooping up the data from the starting offset and sending the query to the coordinator node.
The second query has an additional step: filtering only the users with the ‘age’ property that are larger than 35.
"How long it will take? Will it ruin the performance of the entire cluster? Can it be used in production? The suspense is killing me!"
Testing With TRACING ON
"Now, the actual testing. For starters, let’s run the queries from cqlsh with consistency level QUORUM and tracing on."
Running this query...
SELECT * FROM smartcat_test.users_by_tag WHERE tag = 'linux_AB';
...returned 497 rows. The rows are returned in five batches (100 rows per batch), with each batch having its own tracing session which looked like this:
As we can see, it took ~2.6 ms to fetch the first batch of data (100 results). The consecutive batches (screenshots not included, you’ll have to trust me on that one) took: ~2.5 ms, ~3.2ms, ~3 ms, and ~2.8ms respectively.
Now, let’s do the same for the query with ALLOW FILTERING:
SELECT * FROM smartcat_test.users_by_tag WHERE tag = 'linux_BB' AND age > 35 ALLOW FILTERING;
Again, a reminder about the primary key:
PRIMARY KEY ((tag), last_name, first_name, id)
The query returned 383 rows in 4 batches. The tracing sessions looked like this:
It took ~3.2ms to fetch the first batch of data, and ~2.6 ms, ~2.9 ms, and ~2.9 ms for the remaining batches.
Therefore, we can conclude that the query with ALLOW FILTERING that searches only within one partition performed more or less the same as the query that fetches the whole partition.
"So, we’re done with that, right, Sofia?"
"Whatever, dad…"
"You’re right! Of course we’re not done! But we got the intuition about what we can expect. Now, let’s run a real benchmark. To the laboratory!"
"We’re already in the 'laboratory'/your home office..."
Real Performance Test
For a real performance test, we’ll run 10k requests with the query with ALLOW FILTERING (within a partition) and 10k requests with the query for fetching the whole partition. As said, we’ll use the Ranger and Berserker tools to declare the queries and run the load.
Fetching the Whole Partition With QUORUM Consistency: Performance Test 1
This is how the load generator config looks like:
load-generator-configuration:
data-source-configuration-name: Ranger
rate-generator-configuration-name: default
worker-configuration-name: Cassandra
metrics-reporter-configuration-name: SimpleConsoleReporter
thread-count: 10
queue-capacity: 100000
data-source-configuration:
values:
tag_base: random(['regular', 'semi', 'drama', 'frequent', 'graceful','peaceful', 'nice', 'naughty', 'sudoers', 'male', 'female', 'elf', 'bot', 'lamp', 'linux', 'window', 'mac', 'buyer', 'lurker', 'joker', 'runner'])
tag: string('{}_{}', $tag_base, randomContentString(2, ['A'..'Z']))
statement:
consistencyLevel: QUORUM
query: string("SELECT * FROM users_by_tag WHERE tag = '{}'", $tag)
output: $statement
rate-generator-configuration:
rates:
r: 100
output: $r
worker-configuration:
connection-points: 172.31.0.121:9042
keyspace: smartcat_test
async: true
bootstrap-commands:
- "CREATE KEYSPACE IF NOT EXISTS smartcat_test WITH replication = {'class': 'NetworkTopologyStrategy','eu-central': 3};"
- USE smartcat_test;
- CREATE TABLE IF NOT EXISTS users_by_tag (tag text, last_name text, first_name text, age int, email text, id uuid, PRIMARY KEY ((tag), last_name, first_name, id));
metrics-reporter-configuration:
domain: berserker
This runs a load of ~100 requests per second containing the following query:
"SELECT * FROM users_by_tag WHERE tag = '{}'", $tag
After a minute and a half, we have 10k requests.
Results: Performance Test 1
And the results after 10k of requests looked like this:
-- Histograms ------------------------------------------------------------------
io.smartcat.berserker.responseTime
count = 9998
min = 225850
max = 43595325
mean = 5726809.75
stddev = 3659994.91
median = 5519473.00
75% <= 8245024.00
95% <= 10283130.00
98% <= 10813927.00
99% <= 11000156.00
99.9% <= 43595325.00
As we can see, 99.9% of the requests were served below ~43.59 ms. If we take a look at the lower percentiles (99%, 98%) we see that these requests were served below ~10ms.
Full reports and config files can be viewed here.
Now let’s do the same for the query with ALLOW FILTERING.
Filter the Partition Data With ALLOW FILTERING, QUORUM Consistency: Performance Test 2
Load generator configuration is the same, except that we changed the query part so that it filters on the ‘age’ property:
query: string("SELECT * FROM users_by_tag WHERE tag = '{}' AND age > 35 ALLOW FILTERING;", $tag)
Results: Performance Test 2
The results, after 10k requests, looked like this (pasting only the response time part):
-- Histograms ------------------------------------------------------------------
io.smartcat.berserker.responseTime
count = 9998
min = 245874
max = 14871343
mean = 5689765.53
stddev = 3187364.28
median = 5588926.00
75% <= 8269086.00
95% <= 10278023.00
98% <= 10448252.00
99% <= 10953691.00
99.9% <= 14871343.00
99.9% of the requests were served below ~14.87 ms. Lower percentiles look quite similar to the previous results, where we tested fetching the whole partition.
The bottom line is, in this particular test setup, the ALLOW FILTERING query performed slightly better than the query that fetched the whole partition.
In the consecutive tests I ran, I got better and worse results for 99.9 percentile for the query that fetches the whole partition. So, take the results with a grain of salt. The main point is:
It is safe to run ALLOW FILTERING queries within one partition
These results are for consistency level QUORUM.
Performance Test: Conclusion
With this kind of load, and a data model with over 400 rows per partition, the query with ALLOW FILTERING performed more or less the same (even slightly better!) as the queries that fetched the whole partition.
Just for fun, I ran additional tests with consistency level ONE. Again, the ALLOW FILTERING query performed slightly better. Maybe because there were fewer rows to send over the network (because part of the rows with 'age' > 35 are filtered on the cluster side).
You can see configurations and report logs here, so you can reproduce the results if you want.
If you have more ideas about what could be the reason for this, please comment. I like this kind of troubleshooting and would be happy to discuss it in the comments section.
Also, one more side note. In this test, I set dc_local_read_repair_chance to 0.0 (the property will probably be deprecated in Cassandra 4.0 anyway). Before turning off read repair chance, results from fetching all the rows within the partition were slightly worse because read repair was triggered every once in a while, even though the data was fully synced.
Test configuration:
- Apache Cassandra 3.0.14
- large instances - three nodes in three availability zones
- Centos 7
- Default cassandra.yaml (except Ec2Snitch is used)
- Berserker0.8