In the previous two posts of this series (Part 1 and Part 2) I covered some of the basic commands of cassandra-stress. In this post I will start looking at the use of the stress YAML file for more advanced stress scenarios, particularly where you want to run stress against a schema that matches one you are planning to use for your application.
It’s worth noting in the intro that cassandra-stress with a YAML file use a significantly (80%?) different set of code to the standard read/write/mixed commands. So, some assumptions and learnings from the standard commands won’t hold for YAML-driven stress. To cite one example, when running based on YAML, cassandra-stress does not validate that data returned from a select has the expected values as it does with read or mixed.
For this article, I’ll reference the following YAML specification file:
Before explaining the contents here, let’s see what happens when we run it with the following simple scenario:
cassandra-stress user profile=file:///eg-files/stressprofilemixed.yaml no-warmup ops(insert=1) n=100 -rate threads=1 -node x.x.x.x
After running this on an empty cluster, I ran
select count(*) from eventsrawtest;. The result? 345 rows – probably not you would have guessed. Here’s how cassandra-stress gets to that:
- n=100 counts number of insert batches, not number of individual insert operations
- Each batch will contain 1 partition’s data (due to partitions=fixed(1) setting) and all 15 of the rows in the partition. There are 15 rows in every partition as the single cluster key (time) has a cluster setting of fixed(15). All the rows in the partition will be included in the batch due to the select: fixed(10)/10 setting (ie changing this to say fixed(5)/10 would result in half the rows from the partition being include in any given batch).
- 100 batches of 15 rows each gets you to 1500 rows so how did we end up with 345? This is due (primarily, in this case) to the relatively small range of potential values for the bucket_time. This results in a high overlap in the partition key values that end up getting generated by the uniform distributions. To demonstrate, changing the population of bucket_time to uniform(1..1288) results in 540 rows. In most cases, you want to initially insert data with no overlap to build up a base data set for testing. To facilitate this, I’ve recently submitted a cassandra-stress enhancement that provides sequential generation of seed values the same as used with the write command (https://issues.apache.org/jira/browse/CASSANDRA-12490). Changing the uniform() distribution to seq() results in the expected 1500 rows being inserted by this command.
Let’s look at some of the other column settings:
- population – determines the distribution of seed values used in the random data generation. By controlling the distribution of the seed values you control the distribution of the actual inserted values. So, for example uniform(1..100) will allow for up 100 different values each with the same chance of being selected. guassian(1..100) will also allow for up to 100 different values but as they will follow a guassian (otherwise known as normal or bell-curve) distribution, the values around the middle will have a much higher chance of being selected than the values at the extremes (so there will be a set of values more likely to get repeated and some which will occur very infrequently).
- size – determines the size (length in bytes) of the of the values created for the field.
- cluster – only applies to clustering columns, specifies the number of values for the column appearing in a single partition. The maximum number of rows in a partition is therefore the product of the maximum number of row of each clustering column (eg max(row1) * max(row 2) * max(row3)).
We covered most of the insert settings in the introductory points but here’s a recap:
- partitions: the number of different partitions to include in each generated insert batch. Once a partition is chosen for inclusion in a batch, all rows in the partition will become eligible for inclusion and then be filtered according to the select setting. Using, uniform(1..5) would result in each batches containing between 1 and 5 partitions worth of data (with an equal chance of each number in the range).
- batchtype: logged or unlogged – determines the cassandra batch type to use
- select: select determines the portion of rows from a partition to select (at random) for inclusion in particular batch. So, for example, fixed(5)/10 would include 50% of rows from the selected partition in each batch. uniform(1..10)/10 would result in between 10% and 100% of rows in the partition being included in the batch with a different select percentage being randomly picked for each partition in each batch.
The final section the yaml file that bears some explanation is the queries section. For each query, you specify:
- A name for the query (pull-for-rollup, get-a-value) which are used to refers to the queries when specifying the mix of operations through the cassandra-stress command line.
- cql – The actual query with ? characters where values from the population will be substituted in.
- fields – either samerow or multirow. For samerow, the key values to use for the select will be picked at random (following the same general population rules as for insert) for the list of row keys that has been generated for inserting. For multirow, each of the column values making up the key will be independently randomly selected so there is a chance of generating keys for the selection parameters that don’t exist in the set of data that will/could be inserted according the the population settings.
The ops command specified as part of the command line controls the mix of different operations to run. Take for example the following command:
cassandra-stress user profile=file:///eg-files/stressprofilemixed.yaml ops(insert=1, pull_for_rollup=1, get-value=10) n=120 -node x.x.x.x
This will execute insert batches, pull_for_rollup queries and get-value queries in the ratio 1:1:10. So for this specific example, we’d get 10 inserts, 10 pull_for_rollup queries and 100 get-value queries.
Hopefully that’s explained the key information you need to use a YAML profile for running cassandra stress. In future instalments I’ll take a look at some of the remaining command line options and walk through a full end-to-end example of designing and executing a test.