Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

9/13/2018

Reading time:3 min

Testing Multiple Tables with Cassandra Stress - Instaclustr

by John Doe

The cassandra-stress tool is a powerful tool for benchmarking Cassandra performance. It allows quite sophisticated specification of data and loads profiles to run against almost any table definition you can create in Cassandra. We’ve previously published detailed blog posts on the use of cassandra-stress: Part 1, Part 2 and Part 3.One significant limitation of cassandra-stress has been that it is only able to execute operations against once table at a time. You could work around that by running multiple instances of cassandra-stress but that was not ideal.I recently submitted a patch for Apache Cassandra that now enables multiple tables to be stressed simultaneously with cassandra-stress (https://issues.apache.org/jira/browse/CASSANDRA-8780). This blog post provides some more explanation of how to use this new feature. (While the feature won’t hit release until Cassandra 4.0, it’s pretty easy to download the code and build cassandra-stress yourself if you want to use it in the meantime.)The three core changes you need to know to stress multiple tables in one run are as follows:The profile= command line argument now accepts a comma delimited list of profile yaml files.  Profile yaml files can now optionally contain a specname attribute which provide a way to identify the profile. If it’s not specified, the specname is inferred as <keyspace>.<table>. When specifying operation counts using the ops= command line argument you can prefix them with a specname to refer to an operation from a particular profile (eg spec1.insert). If you don’t specify a   specname, the specname from the first listed yaml file will be inferred. The inferred specnames means that existing single yaml file cassandra-stress configurations will continue to run without requiring any change.The following provides an example of how this can be used in practice:(add all the other standard arguments you would pass to a cassandra-stress run).Within table1.yaml, might look something like: 1234567891011121314151617181920212223242526272829303132333435# Keyspace name and create CQL#specname: t1keyspace: stressexamplekeyspace_definition: |CREATE KEYSPACE stressexample WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};## Table name and create CQL#table: test1table_definition: |CREATE TABLE test5 (pk int,val text,PRIMARY KEY (pk))columnspec:- name: pksize: fixed(64)population: seq(1..100000)## Specs for insert queries#insert:partitions: fixed(1) # 1 partition per batchbatchtype: UNLOGGED # use unlogged batchesselect: fixed(10)/10 # no chance of skipping a row when generating inserts## Read queries to run against the schema#queries:single_read:cql: select * from test1 where pk = ?fields: samerowtable2.yaml would be similar but with specname=t2.When you run this command cassandra-stress will first ensure that keyspaces and tables specified in each of the yaml files are created, creating them itself if necessary. It will then execute operations in the ratios specified in the ops argument – in this case, 10% inserts to table test1 as specified in table1.yaml, 10% reads using the single_read query definition and 80% inserts in the table specified in table2.yaml.One interesting feature is that the multiple yaml files can all reference the same table in your cassandra cluster. I can see this being useful, for instance, where you want to simulate one read/write pattern against the bulk of your partitions while simultaneously simulating a different pattern against a small number of hot partitions. One thing to be aware of with this approach is that the data overlap between two different specs addressing the same table is hard to predict and so may bear close inspection if it’s important to your scenario.I really hope this feature is useful to the Cassandra community. Let me know in the comments section (or the Cassandra user mailing list) if you have any queries or suggestions.

Illustration Image

The cassandra-stress tool is a powerful tool for benchmarking Cassandra performance. It allows quite sophisticated specification of data and loads profiles to run against almost any table definition you can create in Cassandra. We’ve previously published detailed blog posts on the use of cassandra-stress: Part 1, Part 2 and Part 3.

One significant limitation of cassandra-stress has been that it is only able to execute operations against once table at a time. You could work around that by running multiple instances of cassandra-stress but that was not ideal.

I recently submitted a patch for Apache Cassandra that now enables multiple tables to be stressed simultaneously with cassandra-stress (https://issues.apache.org/jira/browse/CASSANDRA-8780). This blog post provides some more explanation of how to use this new feature. (While the feature won’t hit release until Cassandra 4.0, it’s pretty easy to download the code and build cassandra-stress yourself if you want to use it in the meantime.)

The three core changes you need to know to stress multiple tables in one run are as follows:

  1. The profile= command line argument now accepts a comma delimited list of profile yaml files.
  2.  Profile yaml files can now optionally contain a specname attribute which provide a way to identify the profile. If it’s not specified, the specname is inferred as <keyspace>.<table>.
  3. When specifying operation counts using the ops= command line argument you can prefix them with a specname to refer to an operation from a particular profile (eg spec1.insert). If you don’t specify a   specname, the specname from the first listed yaml file will be inferred.

The inferred specnames means that existing single yaml file cassandra-stress configurations will continue to run without requiring any change.

The following provides an example of how this can be used in practice:

(add all the other standard arguments you would pass to a cassandra-stress run).

Within table1.yaml, might look something like:

table2.yaml would be similar but with specname=t2.

When you run this command cassandra-stress will first ensure that keyspaces and tables specified in each of the yaml files are created, creating them itself if necessary. It will then execute operations in the ratios specified in the ops argument – in this case, 10% inserts to table test1 as specified in table1.yaml, 10% reads using the single_read query definition and 80% inserts in the table specified in table2.yaml.

One interesting feature is that the multiple yaml files can all reference the same table in your cassandra cluster. I can see this being useful, for instance, where you want to simulate one read/write pattern against the bulk of your partitions while simultaneously simulating a different pattern against a small number of hot partitions. One thing to be aware of with this approach is that the data overlap between two different specs addressing the same table is hard to predict and so may bear close inspection if it’s important to your scenario.

I really hope this feature is useful to the Cassandra community. Let me know in the comments section (or the Cassandra user mailing list) if you have any queries or suggestions.

Related Articles

stress
datastax
benchmarking

GitHub - nosqlbench/nosqlbench: The open source, pluggable, nosql benchmarking suite.

nosqlbench

6/8/2022

cassandra
testing

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra