Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

1/30/2019

Reading time:9 min

Multi data center Apache Spark and Apache Cassandra benchmark - Instaclustr

by John Doe

GoalThe aim of this benchmark is to compare performances between a one-data-center setting, where Spark and Cassandra are collocated, versus a two-data-center setting where Spark is running on the second data center. The idea behind a two-data-center setting is to use the first data-center (DC1) to serve Cassandra reads/writes, while using the second data-center (DC2) for Spark analytic purpose. This can be achieved by performing reads/writes on a node of DC1 with the consistency level set to LOCAL_QUORUM (or LOCAL_ONE). This setting ensures that writes will be acknowledge only by DC1 (enabling high throughput / low latency) while data are sent asynchronously to DC2. This separation of workload results in a few benefits:The first DC remains consistently responsive for operational reads/writes. The second DC can serve heavy analytic workload without interfering with the first DC. All the data becomes eventually consistent between the first and the second DC. Benchmark setupThe single data center cluster was composed of three m4xl-400 nodes, each of them with Spark and Cassandra. The two data center clusters had its first data center (DC1) with three m4xl-400 nodes running Cassandra, the second data center was composed of three m4xl-400 nodes running Cassandra and Spark. In our offering, m4xl-400 nodes have 4 CPU, 16G of RAM, and 400GB of AWS EBS storage. Two separate stress boxes (one for each cluster) were created to run Cassandra stress and a custom Spark job. The stress boxes were connected via VPC peering, in the same AWS region that the clusters were running. Importantly, all the Cassandra stress and spark benchmark job were run at the same time against the two clusters. This is to ensure that each cluster had the same amount of time to perform background compaction and to recover their full AWS EBS credits when the EBS burst was used. Therefore, the comparison between the single DC cluster and the multi DC cluster performances are always meaningful. On the other hand, the raw performance number should be interpreted with caution, especially due to the drop of EBS performance once its burst credit is exhausted.BenchmarkInitial loadingFor each cluster, two sets of data (“data1” and “data2”) were created using the cassandra-stress tool. The command ran was:cassandra-stress write n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=400 -schema replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) keyspace="data1" -node $IPNote that:The number of records inserted (400 million) is high, resulting in about 120 GB of data stored on each node. This is to ensure that upon future read, the OS will not have the capacity to cache the sstables in RAM, as is observed when using smaller data sets. This is necessary to expose the underlying I/O performance. Although two data centers are specified, only AWS_VPC_US_EAST_1 is used for the single data center cluster. The consistency level is set to LOCAL_QUORUM. For the two data center clusters, the data will be written and acknowledged by DC1 (AWS_VPC_US_EAST_1) as the IP used for the stress command was the one of a node in DC1. In the background and asynchronously, the data is streamed to DC2. For both clusters LOCAL_QUORUM means that two out of the three nodes in the queried data center will need to respond for the operation to be successful. We also created a second data set named “data2”. “data1” will be used later for running a simple spark analytic job, while “data2” will be used for running a background read/write Cassandra stress command, which will result in an increase in its size. The size of “data1” being constant, it can be used to fairly compare the spark performance. Importantly, the performance measured under stress condition in this benchmark should not be considered as baseline performance to use in production. With Cassandra, it is critical to keep some resources free for background activities (large compaction, node repairs, backup, fail over recovery etc…) The write performances were:Cassandra write performance Single DC Multi DC Inserts rate 27439 / s 23141 / s The Single DC cluster has to asynchronously stream the data to its second data center, explaining an expected drop of write performances of about 15%.Maximum read performanceAfter the initial loading of “data1” and “data2”, both the single DC and multi DC cluster were unused for a few hours, giving time for the pending compaction to complete. The baseline read performance was assessed with the following cassandra-stress command:cassandra-stress read n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema keyspace="data2" -node $IPThe read performances were:Cassandra read performance Single DC Multi DC Read rate 5283 / s 5253 / s As can be expected, the read performance is effectively identical as only one data center is being used to serve the reads (Consistency level = LOCAL_QUORUM)Maximum read/write performanceThe baseline read/write performance was assessed with:cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) keyspace="data2" -node $IPThe read / write benchmark is an important indication of peak performances in production as it combines both type of operations. The ratio between read and write is application specific, a ratio of 1 to 3 is somewhat a good representation of what is typically of many production scenarios. In the table below we also report the latency measurement as they are an important indication of performances. Single DC Multi DC Operation rate  7071 / s  7016 / s Latency (ms) Mean 64.3 66 Median 61.0 77.2 95th percentile 177.5 200.7 99th percentile 215.9 249.7 Latency when limiting read/write to half capacity: 3500 op/s (ms) Mean 3.1 3.2 Median 2.6 2.7 95th percentile 6.6 7.0 99th percentile 83.9 92.2 Latency when limiting read/write to a third of capacity: 2300 op/s (ms) Mean 2.7 2.8 Median 2.5 2.6 95th percentile 4.6 4.5 99th percentile 34.2 36.8 In this scenario, both clusters perform identically in term of number of operations per seconds and latency. In the following sections, we will run a spark job without any background read /write, with a peak background read/write (7000 op/s), half (3500 op/s) and a third (2300 op/s).Spark benchmark without background Cassandra activity.As there is no standard spark-cassandra stress tool available at the moment, we used the following Spark application that we ran on both clusters. The Spark application uses the data written by cassandra-stress. The application is not particularly relevant, but it uses a combination of map and reduceByKey that are commonly employed in real life spark application.val rdd1 = sc.cassandraTable("data1", "standard1")       .map(r => (r.getString("key"), r.getString("C1")))       .map{ case(k, v) => (v.substring(5, 8), k.length())} val rdd2 = rdd1.reduceByKey(_+_) rdd2.collect().foreach(println)The job is using the key and the “C1” column of the stress table “standard1”. The key is shortened to the 5 to 8 characters with a map call, allowing a reduce by key with 16^3=4096 possibilities. The reduceByKey is a simple yet commonly used spark rdd transformation that allows to group keys together while applying a reduce operation (in this case a simple addition). This is somewhat a similar example as the classic word count example. Importantly, this job will read the full data set.Spark performance, no background Cassandra activity. Single DC Multi DC Spark time  3829 s  3857 s As can be expected, the performance are identical for the two clusters.Spark benchmark: with maximum background Cassandra read/write.The same spark job was run with a background Cassandra read/write stress command:cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IPSpark performance, full read/write Cassandra activity. Single DC Multi DC Spark time  11528 s  4894 s Effective Cassandra operation rate 2866 / s 5630 / s This test result demonstrates the gain in having a separate DC for Spark activity. Note that a large quantity of read/writes against the single DC timed-out as Cassandra cannot keep up with serving simultaneously Spark read requests and Cassandra-stress read/write requests.Spark benchmark: with half background Cassandra read/write.The cassandra-stress command was used with a limit factor so that the number of cassandra operation would be limited to half of the capacity (3500 op/s):cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 limit=3500/s -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IPThe test result were:Spark performance, full read/write Cassandra activity. Single DC Multi DC Spark time  12914 s 4919  s Note that this test was run shortly after the previous benchmark (with full read/write cassandra-stress), preventing the EBS volumes of both clusters to fully recover their burst credit. During this test, we still observed a large amount of Cassandra timeout on the single DC cluster.Spark benchmark: with a third background Cassandra read/write.The limit rate was set to a third of the baseline read/write performance which is about 2300 operation per seconds. Unlike in the previous test, the EBS volume had enough time to recover its full burst capacity credit.cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 limit=2300/s -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IPSpark performance, full read/write Cassandra activity. Single DC Multi DC Spark time  7877 s 4572  s During this test, we again observed a large amount of Cassandra timeout on the single DC cluster.ConclusionThrough this benchmark, we compared the performance between a single DC and a multi DC cluster when simultaneously running some Cassandra read/write operations and executing a Spark application. In order to perform a fair comparison, we ran every stress command at the same time on both clusters, ensuring that any protocol bias (file cached in memory, time to perform background Cassandra compaction, EBS burst recovery etc…) would apply to both clusters. The multi DC cluster was able to process the spark job up to 2.3 times faster than the single DC cluster when running cassandra-stress (without limiting the rate) in the background. Maybe more importantly, the muti DC cluster was able to process twice the rate of Cassandra operations per seconds when compared with the single DC cluster. This demonstrates that a multi DC topology ensure overall higher performance both for Cassandra and Spark, with more stability (the cassandra-stress ran on the single DC cluster resulted in a large number of timeout errors during read operations). The multi DC topology is therefore particularly relevant when running a large amount of spark analytic jobs on a cluster hosting mission critical data.

Illustration Image

Goal

The aim of this benchmark is to compare performances between a one-data-center setting, where Spark and Cassandra are collocated, versus a two-data-center setting where Spark is running on the second data center. The idea behind a two-data-center setting is to use the first data-center (DC1) to serve Cassandra reads/writes, while using the second data-center (DC2) for Spark analytic purpose. This can be achieved by performing reads/writes on a node of DC1 with the consistency level set to LOCAL_QUORUM (or LOCAL_ONE). This setting ensures that writes will be acknowledge only by DC1 (enabling high throughput / low latency) while data are sent asynchronously to DC2. This separation of workload results in a few benefits:

  1. The first DC remains consistently responsive for operational reads/writes.
  2. The second DC can serve heavy analytic workload without interfering with the first DC.
  3. All the data becomes eventually consistent between the first and the second DC.

Benchmark setup

The single data center cluster was composed of three m4xl-400 nodes, each of them with Spark and Cassandra. The two data center clusters had its first data center (DC1) with three m4xl-400 nodes running Cassandra, the second data center was composed of three m4xl-400 nodes running Cassandra and Spark. In our offering, m4xl-400 nodes have 4 CPU, 16G of RAM, and 400GB of AWS EBS storage. Two separate stress boxes (one for each cluster) were created to run Cassandra stress and a custom Spark job. The stress boxes were connected via VPC peering, in the same AWS region that the clusters were running. Importantly, all the Cassandra stress and spark benchmark job were run at the same time against the two clusters. This is to ensure that each cluster had the same amount of time to perform background compaction and to recover their full AWS EBS credits when the EBS burst was used. Therefore, the comparison between the single DC cluster and the multi DC cluster performances are always meaningful. On the other hand, the raw performance number should be interpreted with caution, especially due to the drop of EBS performance once its burst credit is exhausted.

Benchmark

Initial loading

For each cluster, two sets of data (“data1” and “data2”) were created using the cassandra-stress tool. The command ran was:

cassandra-stress write n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=400 -schema replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) keyspace="data1" -node $IP

Note that:

  • The number of records inserted (400 million) is high, resulting in about 120 GB of data stored on each node. This is to ensure that upon future read, the OS will not have the capacity to cache the sstables in RAM, as is observed when using smaller data sets. This is necessary to expose the underlying I/O performance.
  • Although two data centers are specified, only AWS_VPC_US_EAST_1 is used for the single data center cluster.
  • The consistency level is set to LOCAL_QUORUM. For the two data center clusters, the data will be written and acknowledged by DC1 (AWS_VPC_US_EAST_1) as the IP used for the stress command was the one of a node in DC1. In the background and asynchronously, the data is streamed to DC2. For both clusters LOCAL_QUORUM means that two out of the three nodes in the queried data center will need to respond for the operation to be successful.
  • We also created a second data set named “data2”. “data1” will be used later for running a simple spark analytic job, while “data2” will be used for running a background read/write Cassandra stress command, which will result in an increase in its size. The size of “data1” being constant, it can be used to fairly compare the spark performance.
  • Importantly, the performance measured under stress condition in this benchmark should not be considered as baseline performance to use in production. With Cassandra, it is critical to keep some resources free for background activities (large compaction, node repairs, backup, fail over recovery etc…)

The write performances were:

Cassandra write performance Single DC Multi DC
Inserts rate 27439 / s 23141 / s

The Single DC cluster has to asynchronously stream the data to its second data center, explaining an expected drop of write performances of about 15%.

Maximum read performance

After the initial loading of “data1” and “data2”, both the single DC and multi DC cluster were unused for a few hours, giving time for the pending compaction to complete. The baseline read performance was assessed with the following cassandra-stress command:

cassandra-stress read n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema keyspace="data2" -node $IP

The read performances were:

Cassandra read performance Single DC Multi DC
Read rate 5283 / s 5253 / s

As can be expected, the read performance is effectively identical as only one data center is being used to serve the reads (Consistency level = LOCAL_QUORUM)

Maximum read/write performance

The baseline read/write performance was assessed with:

cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) keyspace="data2" -node $IP

The read / write benchmark is an important indication of peak performances in production as it combines both type of operations. The ratio between read and write is application specific, a ratio of 1 to 3 is somewhat a good representation of what is typically of many production scenarios. In the table below we also report the latency measurement as they are an important indication of performances.

Single DC Multi DC
Operation rate  7071 / s  7016 / s
Latency (ms) Mean 64.3 66
Median 61.0 77.2
95th percentile 177.5 200.7
99th percentile 215.9 249.7
Latency when limiting read/write to half capacity: 3500 op/s (ms) Mean 3.1 3.2
Median 2.6 2.7
95th percentile 6.6 7.0
99th percentile 83.9 92.2
Latency when limiting read/write to a third of capacity: 2300 op/s (ms) Mean 2.7 2.8
Median 2.5 2.6
95th percentile 4.6 4.5
99th percentile 34.2 36.8

In this scenario, both clusters perform identically in term of number of operations per seconds and latency. In the following sections, we will run a spark job without any background read /write, with a peak background read/write (7000 op/s), half (3500 op/s) and a third (2300 op/s).

Spark benchmark without background Cassandra activity.

As there is no standard spark-cassandra stress tool available at the moment, we used the following Spark application that we ran on both clusters. The Spark application uses the data written by cassandra-stress. The application is not particularly relevant, but it uses a combination of map and reduceByKey that are commonly employed in real life spark application.

val rdd1 = sc.cassandraTable("data1", "standard1")       .map(r => (r.getString("key"), r.getString("C1")))       .map{ case(k, v) => (v.substring(5, 8), k.length())} val rdd2 = rdd1.reduceByKey(_+_) rdd2.collect().foreach(println)

The job is using the key and the “C1” column of the stress table “standard1”. The key is shortened to the 5 to 8 characters with a map call, allowing a reduce by key with 16^3=4096 possibilities. The reduceByKey is a simple yet commonly used spark rdd transformation that allows to group keys together while applying a reduce operation (in this case a simple addition). This is somewhat a similar example as the classic word count example. Importantly, this job will read the full data set.

Spark performance, no background Cassandra activity. Single DC Multi DC
Spark time  3829 s  3857 s

As can be expected, the performance are identical for the two clusters.

Spark benchmark: with maximum background Cassandra read/write.

The same spark job was run with a background Cassandra read/write stress command:

cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IP

Spark performance, full read/write Cassandra activity. Single DC Multi DC
Spark time  11528 s  4894 s
Effective Cassandra operation rate 2866 / s 5630 / s

This test result demonstrates the gain in having a separate DC for Spark activity. Note that a large quantity of read/writes against the single DC timed-out as Cassandra cannot keep up with serving simultaneously Spark read requests and Cassandra-stress read/write requests.

Spark benchmark: with half background Cassandra read/write.

The cassandra-stress command was used with a limit factor so that the number of cassandra operation would be limited to half of the capacity (3500 op/s):

cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 limit=3500/s -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IP

The test result were:

Spark performance, full read/write Cassandra activity. Single DC Multi DC
Spark time  12914 s 4919  s

Note that this test was run shortly after the previous benchmark (with full read/write cassandra-stress), preventing the EBS volumes of both clusters to fully recover their burst credit. During this test, we still observed a large amount of Cassandra timeout on the single DC cluster.

Spark benchmark: with a third background Cassandra read/write.

The limit rate was set to a third of the baseline read/write performance which is about 2300 operation per seconds. Unlike in the previous test, the EBS volume had enough time to recover its full burst capacity credit.

cassandra-stress mixed ratio\(write=1,read=3\) n=400000000 no-warmup cl=LOCAL_QUORUM -rate threads=500 limit=2300/s -schema keyspace="data2" replication\(strategy=NetworkTopologyStrategy,spark=3,AWS_VPC_US_EAST_1=3\) -node $IP

Spark performance, full read/write Cassandra activity. Single DC Multi DC
Spark time  7877 s 4572  s

During this test, we again observed a large amount of Cassandra timeout on the single DC cluster.

Conclusion

Through this benchmark, we compared the performance between a single DC and a multi DC cluster when simultaneously running some Cassandra read/write operations and executing a Spark application. In order to perform a fair comparison, we ran every stress command at the same time on both clusters, ensuring that any protocol bias (file cached in memory, time to perform background Cassandra compaction, EBS burst recovery etc…) would apply to both clusters. The multi DC cluster was able to process the spark job up to 2.3 times faster than the single DC cluster when running cassandra-stress (without limiting the rate) in the background. Maybe more importantly, the muti DC cluster was able to process twice the rate of Cassandra operations per seconds when compared with the single DC cluster. This demonstrates that a multi DC topology ensure overall higher performance both for Cassandra and Spark, with more stability (the cassandra-stress ran on the single DC cluster resulted in a large number of timeout errors during read operations). The multi DC topology is therefore particularly relevant when running a large amount of spark analytic jobs on a cluster hosting mission critical data.

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra