In case you missed it, this blog post is a recap of Cassandra Lunch #30, covering the basics of using Spark and Cassandra together. We discuss the advantages of each and then cover the advantages of using them together. We also discuss the potential drawbacks, and configuration methods for avoiding those drawbacks. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
Cassandra & Spark Separately
In Common
Both Cassandra and Spark are proven open-source technologies. They can both be deployed to physical devices, virtual machines, containers, and cloud providers. This can be done singularly, where one cluster is entirely on a single deployment method or hybrid where they are deployed across two or more. Both Cassandra and Spark are massively distributable and scalable. They also each have an SQL-like interface, with CQL for Cassandra and SparkSQL for Spark.
Cassandra
Cassandra offers multi-region masterless data replication. It also is optimized as a data transport/communication engine. It’s a distributed NoSQL datastore.
Spark
Spark is a distributed analytics engine. It has APIs for Python, Scala, Java, and R. It can perform both batch and stream processing operations.
Cassandra & Spark Together
The first method for using Spark and Cassandra together that we discussed is a Translytical Setup. In this setup, we have a multi-datacenter Cassandra setup where one datacenter is used purely for transactions (read, writes, client interactions). Data is replicated to both datacenters and the second is connected to Spark and used for Analytics.
In the second method, we pull the Cassandra data into Spark as one of Spark’s core data types and use data locality to ensure speed of operations. Spark’s core data types are RDDs, Datasets, and DataFrames. Data locality is when the Spark node connected to each Cassandra node knows what data that node is responsible for via token ranges and the workload for working with the Cassandra data is distributed based on that.
The last method is one in which Cassandra data is made available as SparkSQL. When SparkSQL is passed standard CQL statements, it can process with filters based on Native Indices (based on the primary key), Secondary Indices, as well as storage attached indices like SASI or SAI. It can also use all the same indices while also filtering by token range, like in the data locality example above. We can also use a Spark feature called predicate pushdown in order to filter by arbitrary tokens inside an index, alongside all of the filtering methods.
If not set up carefully, problems can arise in our Spark and Cassandra environment. Cassandra data skews spill over to Spark and become Spark data skews and compute skews, where singular nodes end up with an outsized data load or computational workload. Compute skews can also happen due to poor functional programming practices.
Resources
Cassandra and Spark: Optimizing for Data Locality – Databricks
Spark And Cassandra: 2 Fast, 2 Furious – Databricks
Cassandra and SparkSQL: You Don’t Need Functional Programming for Fun…
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spit…
Spark and Cassandra: An Amazing Apache Love Story – Databricks
Cassandra and Spark, closing the gap between no sql and analytics c…
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!