8/24/2017

Reading time:9 min

Cassandra - to BATCH or not to BATCH

by John Doe

This post is about Cassandra’s batch statements and which kind of batch statements are ok and which not. Often, when batch statements are discussed,it’s not clear if a particular statement refers to single- or multi-partition batches or to both - which is the most important question IMO (you should knowwhy after you’ve read this post).Instead, it is often differentiated between logged and unlogged batches ([1], [2], [3]). For one thing, logged/unloggedmay cause misunderstandings (more on this below). For another thing, unlogged batches (for multiple partitions) are deprecated since Cassandra 2.1.6([1], [4], [5]). Therefore I think that the distinction between logged/unlogged is not that (no longer) useful.In the following I’m describing multi- and single-partition batches: what they provide, what they cost and when it’s ok to use them.TL;DR:Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.Single partition batches can be used to get atomicity + isolation, they’re not much more expensive than normal writes.DefinitionsAt first, what are single partition and what are multi partition batches?A single partition BATCH only contains statements with a single partition key.If, for example, we have a table user_tweets with PRIMARY KEY(user_id, tweet_id), then a batch with inserts with user_id=1, tweet_id=1 and user_id=1,tweet_id=2would target the same partition (on a single node).A multi partition batch is a BATCH that contains statements with different partition keys.If, for example, we have a table users with PRIMARY KEY(user_id), then a batch with inserts for user_id=1 and user_id=2 would target 2 partitions (which may live on different nodes).The same holds true if a batch contains writes for different tables or even keyspaces: obviously such writes also target different partitions (maybe on different nodes).Multi Partition BatchesLet’s see what the characteristics of multi partition batches are…What does a multi partition batch provide?Atomicity. Nothing more, notably no isolation.That means that from the writers point of view all or nothing gets written (atomicity).But other clients can see the first write (insert/update/delete) before other writes from the batchare applied (no isolation).What does a multi partition batch cost?Let me just quote Christopher Batey, because he has summarized this very well in his post“Cassandra anti-pattern: Logged batches” [3]:Cassandra [is first] writing all the statements to a batch log. That batch log is replicated to two other nodes in case the coordinator fails.If the coordinator fails then another replica for the batch log will take over. [..] The coordinator has to do a lot more work than any other node in the cluster.Again, in bullets what has to be done:serialize the batch statementswrite the serialized batch to the batch log system tablereplicate of this serialized batch to 2 nodescoordinate writes to nodes holding the different partitionson success remove the serialized batch from the batch log (also on the 2 replicas)Here’s an image (from Christopher Batey’s blog post) that visualizes this for a batch with 8 insert statements for different partitions(equally distributed across nodes) in a 8 node cluster: And this does even not show arrows for responses and batch log cleanup (between the coordinator and batch log replicas).Hopefully it’s clear that such a multi partition batch is very expensive!What are the limitations of multi partition batches?Cassandra by default logs warnings for batches > 5kb and fails/refuses batches > 50kb, according to the configuration of batch_size_{warn|fail}_threshold_in_kb([7], [8]).When should you use logged / multi partition batches?Usually - just don’t!When you’re writing to multiple partitions you should prefer multiple async writes, as described in“Cassandra: Batch loading without the Batch keyword” [9].Multi partition batches can be used when atomicity / consistency across tables is needed - in such cases very small “logged” batches are fine.But imagine a situation where you’re writing to two tables users (PK: id) and users_by_username (PK: username) in the context of a “batch” job(e.g. data import/replication, housekeeping, whatever). Then maybe tens of thousands of records might be updated in a very short time. In such a case I’d alsoprefer two async writes for each user (plus appropriate failure handling) because I’d consider batches too expensive in this case.Single Partition BatchesNow let’s see what the characteristics of single partition batches are…What does a single partition batch provide?Atomicity + isolation.That means that all or nothing gets written. Other clients don’t see partial updates, but only all of them after they’ve been applied.Single partition batches (compared to multiple async statements) save network round-trips between the client and the server and therefore may improve throughput.What does a single partition batch cost?There’s no batch log written for single partition batches. The coordinator doesn’t have any extra work (as for multi partition writes) because everything goes into a single partition.Single partition batches are optimized: they are applied with a single RowMutation [10].In a few words: single partition batches don’t put much more load on the server than normal writes.Sorry that I have no image for this, but a single arrow between client and server would be just too boring ;-)What are the limitations of single partition batches?Today (01/2016) Cassandra’s batch_size_{warn|fail}_threshold_in_kb config is also applied to single partition batches. Therefore alsofor a batch targeting a single partition you’ll get warnings. This should be resolved soon [11] so that single partition batches then are free to use.As it’s the case for other writes single partition batches should not be too large, because otherwise they would cause heap pressureand in consequence lead to long gc pauses [12].Because single partition batches are often used in combination with wide rows you should make sure that a row (partition) doesn’t become too large.The recommendation of Datastax is that rows should not become much bigger than 100mb (you can check this with nodetool).While this is not directly a limitation of single partition batches you should be aware of this fact to keep your C* cluster happy - partition your partitions.When should you use single partition batches?Single partition batches should be used when atomicity and isolation is required.Even if you only need atomicity (and no isolation) you should model your data so that you can use single partitioninstead of multi partition batches.Single partition batches may also be used to increase the throughput compared to multiple un-batched statements.Of course you must benchmark your workload with your own setup/infrastructure to verify this assumption. If you don’t want to do thisyou shouldn’t use single partition batches if you don’t need atomicity/isolation.Logged vs. UnloggedIn this paragraph I want to explain why I find the differentiation in the first place between logged and unlogged not that useful.I think a major issue is that “logged” or “unlogged” may refer to different things:server side: “logged” means that C* writes a batch log (plus all the ceremony around this), “unlogged” means that no batch log is written.client side: “logged” refers to BEGIN BATCH (respectively the driver’s logged batch), “unlogged” refers to BEGIN UNLOGGED BATCH.E.g. the java driver allows to create a new BatchStatement(BatchStatement.Type.UNLOGGED), the default type being Type.LOGGED.Even when the client is asking for a logged batch (with BEGIN BATCH or the driver’s equivalent) this does not necessarily mean that theserver (C*) will really write a batch log (and do all these costly things).Let’s have a look at the possible combinations of the client’s instruction for LOGGED/UNLOGGED andthe partition property (single/multi). For each combination we see if C* is writing a batch log,if the batch is applied atomically and if it’s applied in isolatation.↓ client: (un)logged // single/multi partition → single partition multi partition logged (BEGIN BATCH) no batch log, atomic, isolated batch log, atomic unlogged (BEGIN UNLOGGED BATCH) no batch log, atomic, isolated no batch log Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6 ([1], [4], [5])?With the table above it should be clear that one can just forget this “unlogged” thing. Hopefully at somepoint it will be completely removed from CQL and drivers.Rant: Improve the docs!An issue today (01/2016) is that blog posts and documentation often only talk about multi partition batches (without making this explicit).E.g. the CQL commands reference for BATCH [14] explains all the bad things around BATCHes and talks about logged and unlogged.Unfortunately it does not make clear how important it is if the batch targets a single or multiple partitions.Today it only says ”However, transactional row updates within a partition key are isolated: clients cannot read a partial update.”.Nothing more about single partition batches.The same holds true for the “Using and misusing batches” section [1] in the CQL docs. There you can readBatches place a burden on the coordinator for both logged and unlogged batches.True for multi partition batches, wrong for single partition batches.Another example is the javadoc for the java driver’s BatchStatement.Type [15]. There LOGGED is documented withA logged batch: Cassandra will first write the batch to its distributed batch log to ensure the atomicity of the batch (atomicity meaning that if any statement in the batch succeeds, all will eventually succeed).And for UNLOGGED you readA batch that doesn’t use Cassandra’s distributed batch log. Such batch are not guaranteed to be atomic.This should also mention the relevance of the partition keys involved in the batch statement.C* docs team, please mention that for single partition batches there’s no batch log and no ceremony involved and they’re atomic and isolated!SummaryWhen you hear/read s.th. about C* batch statements, ask (yourself) if it’s about single- or multi-partition batches.The common mantra that batch statements should be avoided most often refers to multi-partition batches.Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.Single partition batches can be used to achieve atomicity and isolation. They’re not much more expensive than normal writes.ReferencesCQL for Cassandra 2.2+: Using and misusing batches (latest version)Christopher Batey’s Blog: Cassandra anti-pattern: Misuse of unlogged batchesChristopher Batey’s Blog: Cassandra anti-pattern: Logged batchesCASSANDRA-9282: Warn on unlogged batchesDataStax Help Center: New messages about the use of batches in Cassandra 2.1CQL Glossary: partition keyCASSANDRA-6487: Log WARN on large batch sizesCASSANDRA-8011: Fail on large batch sizesCassandra: Batch loading without the Batch keywordCASSANDRA-6737: A batch statements on a single partition should not create a new CF object for each updateCASSANDRA-10876: Alter behavior of batch WARN and fail on single partition batchesDataStax Help Center: Common Causes of GC pausesCassandra 2.1 tools: nodetool cfhistograms (for C* 3.+ tablehistograms)CQL for Cassandra 2.2+: CQL commands / BATCHJavadoc for BatchStatement.Type

Read this article if you want to know more about Cassandra - to BATCH or not to BATCH

This post is about Cassandra’s batch statements and which kind of batch statements are ok and which not. Often, when batch statements are discussed, it’s not clear if a particular statement refers to single- or multi-partition batches or to both - which is the most important question IMO (you should know why after you’ve read this post).

Instead, it is often differentiated between logged and unlogged batches ([1], [2], [3]). For one thing, logged/unlogged may cause misunderstandings (more on this below). For another thing, unlogged batches (for multiple partitions) are deprecated since Cassandra 2.1.6 ([1], [4], [5]). Therefore I think that the distinction between logged/unlogged is not that (no longer) useful.

In the following I’m describing multi- and single-partition batches: what they provide, what they cost and when it’s ok to use them.

TL;DR: Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive. Single partition batches can be used to get atomicity + isolation, they’re not much more expensive than normal writes.

Definitions

At first, what are single partition and what are multi partition batches?

A single partition BATCH only contains statements with a single partition key. If, for example, we have a table user_tweets with PRIMARY KEY(user_id, tweet_id), then a batch with inserts with user_id=1, tweet_id=1 and user_id=1,tweet_id=2 would target the same partition (on a single node).

A multi partition batch is a BATCH that contains statements with different partition keys. If, for example, we have a table users with PRIMARY KEY(user_id), then a batch with inserts for user_id=1 and user_id=2 would target 2 partitions (which may live on different nodes). The same holds true if a batch contains writes for different tables or even keyspaces: obviously such writes also target different partitions (maybe on different nodes).

Multi Partition Batches

Let’s see what the characteristics of multi partition batches are…

What does a multi partition batch provide?

Atomicity. Nothing more, notably no isolation.

That means that from the writers point of view all or nothing gets written (atomicity). But other clients can see the first write (insert/update/delete) before other writes from the batch are applied (no isolation).

What does a multi partition batch cost?

Let me just quote Christopher Batey, because he has summarized this very well in his post “Cassandra anti-pattern: Logged batches” [3]:

Cassandra [is first] writing all the statements to a batch log. That batch log is replicated to two other nodes in case the coordinator fails. If the coordinator fails then another replica for the batch log will take over. [..] The coordinator has to do a lot more work than any other node in the cluster.

Again, in bullets what has to be done:

serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)

Here’s an image (from Christopher Batey’s blog post) that visualizes this for a batch with 8 insert statements for different partitions (equally distributed across nodes) in a 8 node cluster:

And this does even not show arrows for responses and batch log cleanup (between the coordinator and batch log replicas).

Hopefully it’s clear that such a multi partition batch is very expensive!

What are the limitations of multi partition batches?

Cassandra by default logs warnings for batches > 5kb and fails/refuses batches > 50kb, according to the configuration of batch_size_{warn|fail}_threshold_in_kb ([7], [8]).

When should you use logged / multi partition batches?

Usually - just don’t!

When you’re writing to multiple partitions you should prefer multiple async writes, as described in “Cassandra: Batch loading without the Batch keyword” [9].

Multi partition batches can be used when atomicity / consistency across tables is needed - in such cases very small “logged” batches are fine.

But imagine a situation where you’re writing to two tables users (PK: id) and users_by_username (PK: username) in the context of a “batch” job (e.g. data import/replication, housekeeping, whatever). Then maybe tens of thousands of records might be updated in a very short time. In such a case I’d also prefer two async writes for each user (plus appropriate failure handling) because I’d consider batches too expensive in this case.

Single Partition Batches

Now let’s see what the characteristics of single partition batches are…

What does a single partition batch provide?

Atomicity + isolation.

That means that all or nothing gets written. Other clients don’t see partial updates, but only all of them after they’ve been applied.

Single partition batches (compared to multiple async statements) save network round-trips between the client and the server and therefore may improve throughput.

What does a single partition batch cost?

There’s no batch log written for single partition batches. The coordinator doesn’t have any extra work (as for multi partition writes) because everything goes into a single partition. Single partition batches are optimized: they are applied with a single RowMutation [10].

In a few words: single partition batches don’t put much more load on the server than normal writes.

Sorry that I have no image for this, but a single arrow between client and server would be just too boring ;-)

What are the limitations of single partition batches?

Today (01/2016) Cassandra’s batch_size_{warn|fail}_threshold_in_kb config is also applied to single partition batches. Therefore also for a batch targeting a single partition you’ll get warnings. This should be resolved soon [11] so that single partition batches then are free to use.

As it’s the case for other writes single partition batches should not be too large, because otherwise they would cause heap pressure and in consequence lead to long gc pauses [12].

Because single partition batches are often used in combination with wide rows you should make sure that a row (partition) doesn’t become too large. The recommendation of Datastax is that rows should not become much bigger than 100mb (you can check this with nodetool). While this is not directly a limitation of single partition batches you should be aware of this fact to keep your C* cluster happy - partition your partitions.

When should you use single partition batches?

Single partition batches should be used when atomicity and isolation is required.

Even if you only need atomicity (and no isolation) you should model your data so that you can use single partition instead of multi partition batches.

Single partition batches may also be used to increase the throughput compared to multiple un-batched statements. Of course you must benchmark your workload with your own setup/infrastructure to verify this assumption. If you don’t want to do this you shouldn’t use single partition batches if you don’t need atomicity/isolation.

Logged vs. Unlogged

In this paragraph I want to explain why I find the differentiation in the first place between logged and unlogged not that useful.

I think a major issue is that “logged” or “unlogged” may refer to different things:

server side: “logged” means that C* writes a batch log (plus all the ceremony around this), “unlogged” means that no batch log is written.
client side: “logged” refers to BEGIN BATCH (respectively the driver’s logged batch), “unlogged” refers to BEGIN UNLOGGED BATCH. E.g. the java driver allows to create a new BatchStatement(BatchStatement.Type.UNLOGGED), the default type being Type.LOGGED.

Even when the client is asking for a logged batch (with BEGIN BATCH or the driver’s equivalent) this does not necessarily mean that the server (C*) will really write a batch log (and do all these costly things). Let’s have a look at the possible combinations of the client’s instruction for LOGGED/UNLOGGED and the partition property (single/multi). For each combination we see if C* is writing a batch log, if the batch is applied atomically and if it’s applied in isolatation.

↓ client: (un)logged // single/multi partition →	single partition	multi partition
logged (BEGIN BATCH)	no batch log, atomic, isolated	batch log, atomic
unlogged (BEGIN UNLOGGED BATCH)	no batch log, atomic, isolated	no batch log

Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6 ([1], [4], [5])? With the table above it should be clear that one can just forget this “unlogged” thing. Hopefully at some point it will be completely removed from CQL and drivers.

Rant: Improve the docs!

An issue today (01/2016) is that blog posts and documentation often only talk about multi partition batches (without making this explicit).

E.g. the CQL commands reference for BATCH [14] explains all the bad things around BATCHes and talks about logged and unlogged. Unfortunately it does not make clear how important it is if the batch targets a single or multiple partitions. Today it only says ”However, transactional row updates within a partition key are isolated: clients cannot read a partial update.”. Nothing more about single partition batches.

The same holds true for the “Using and misusing batches” section [1] in the CQL docs. There you can read

Batches place a burden on the coordinator for both logged and unlogged batches.

True for multi partition batches, wrong for single partition batches.

Another example is the javadoc for the java driver’s BatchStatement.Type [15]. There LOGGED is documented with

A logged batch: Cassandra will first write the batch to its distributed batch log to ensure the atomicity of the batch (atomicity meaning that if any statement in the batch succeeds, all will eventually succeed).

And for UNLOGGED you read

A batch that doesn’t use Cassandra’s distributed batch log. Such batch are not guaranteed to be atomic.

This should also mention the relevance of the partition keys involved in the batch statement.

C* docs team, please mention that for single partition batches there’s no batch log and no ceremony involved and they’re atomic and isolated!

Summary

When you hear/read s.th. about C* batch statements, ask (yourself) if it’s about single- or multi-partition batches.
The common mantra that batch statements should be avoided most often refers to multi-partition batches.
Multi partition batches should only be used to achieve atomicity for a few writes on different tables. Apart from this they should be avoided because they’re too expensive.
Single partition batches can be used to achieve atomicity and isolation. They’re not much more expensive than normal writes.

References

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company