Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.
2/24/2021
Reading time:2 min
Spark and Cassandra’s SSTable loader
by John Doe
ArunkumarMay 13, 2018·3 min readWhy: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.Latencies during Cassandra row writesThe spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).Latencies during Cassandra SSTable loadsAlso if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.Network Traffic (Row writes vs SSTable load)Now let’s get into how to do that in code,Using CQLSSTableWriter build the SSTables per partitionWe need to define the create and insert statements, but it’s easy to build that from the spark dataframeAnd stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.And finally the code that run’s it all,As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.Let say the size is 60GB, we will have 256 SSTables of size 256MB each.Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.Fyi,SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala
Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.
First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,
Claim Your Free Planet Cassandra Contributor T-shirt!
Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.
Join Our Newsletter!
Sign up below to receive email updates and see what's going on with our company