The cassandra-stress tool is a powerful tool for benchmarking Cassandra performance. It allows quite sophisticated specification of data and loads profiles to run against almost any table definition you can create in Cassandra. We’ve previously published detailed blog posts on the use of cassandra-stress: Part 1, Part 2 and Part 3.
One significant limitation of cassandra-stress has been that it is only able to execute operations against once table at a time. You could work around that by running multiple instances of cassandra-stress but that was not ideal.
I recently submitted a patch for Apache Cassandra that now enables multiple tables to be stressed simultaneously with cassandra-stress (https://issues.apache.org/jira/browse/CASSANDRA-8780). This blog post provides some more explanation of how to use this new feature. (While the feature won’t hit release until Cassandra 4.0, it’s pretty easy to download the code and build cassandra-stress yourself if you want to use it in the meantime.)
The three core changes you need to know to stress multiple tables in one run are as follows:
- The profile= command line argument now accepts a comma delimited list of profile yaml files.
- Profile yaml files can now optionally contain a specname attribute which provide a way to identify the profile. If it’s not specified, the specname is inferred as <keyspace>.<table>.
- When specifying operation counts using the ops= command line argument you can prefix them with a specname to refer to an operation from a particular profile (eg spec1.insert). If you don’t specify a specname, the specname from the first listed yaml file will be inferred.
The inferred specnames means that existing single yaml file cassandra-stress configurations will continue to run without requiring any change.
The following provides an example of how this can be used in practice:
(add all the other standard arguments you would pass to a cassandra-stress run).
Within table1.yaml, might look something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# Keyspace name and create CQL # specname: t1 keyspace: stressexample keyspace_definition: | CREATE KEYSPACE stressexample WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}; # # Table name and create CQL # table: test1 table_definition: | CREATE TABLE test5 ( pk int, val text, PRIMARY KEY (pk) ) columnspec: - name: pk size: fixed(64) population: seq(1..100000) # # Specs for insert queries # insert: partitions: fixed(1) # 1 partition per batch batchtype: UNLOGGED # use unlogged batches select: fixed(10)/10 # no chance of skipping a row when generating inserts # # Read queries to run against the schema # queries: single_read: cql: select * from test1 where pk = ? fields: samerow |
table2.yaml would be similar but with specname=t2.
When you run this command cassandra-stress will first ensure that keyspaces and tables specified in each of the yaml files are created, creating them itself if necessary. It will then execute operations in the ratios specified in the ops argument – in this case, 10% inserts to table test1 as specified in table1.yaml, 10% reads using the single_read query definition and 80% inserts in the table specified in table2.yaml.
One interesting feature is that the multiple yaml files can all reference the same table in your cassandra cluster. I can see this being useful, for instance, where you want to simulate one read/write pattern against the bulk of your partitions while simultaneously simulating a different pattern against a small number of hot partitions. One thing to be aware of with this approach is that the data overlap between two different specs addressing the same table is hard to predict and so may bear close inspection if it’s important to your scenario.
I really hope this feature is useful to the Cassandra community. Let me know in the comments section (or the Cassandra user mailing list) if you have any queries or suggestions.