Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/24/2021

Reading time:2 min

Spark and Cassandra’s SSTable loader

by John Doe

ArunkumarMay 13, 2018·3 min readWhy: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.Latencies during Cassandra row writesThe spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).Latencies during Cassandra SSTable loadsAlso if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.Network Traffic (Row writes vs SSTable load)Now let’s get into how to do that in code,Using CQLSSTableWriter build the SSTables per partitionWe need to define the create and insert statements, but it’s easy to build that from the spark dataframeAnd stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.And finally the code that run’s it all,As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.Let say the size is 60GB, we will have 256 SSTables of size 256MB each.Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.Fyi,SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala

Illustration Image

Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.

First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,

This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.

Latencies during Cassandra row writes

The spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.

Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).

Latencies during Cassandra SSTable loads

Also if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.

Network Traffic (Row writes vs SSTable load)

Now let’s get into how to do that in code,

  • Using CQLSSTableWriter build the SSTables per partition
  • We need to define the create and insert statements, but it’s easy to build that from the spark dataframe
  • And stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.
  • And finally the code that run’s it all,
  • As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.
  • Let say the size is 60GB, we will have 256 SSTables of size 256MB each.
  • Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.

Fyi,

  • SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.
  • This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.

We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.

Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra