Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

5/20/2021

Reading time:3 min

Apache Cassandra Lunch #30: Cassandra & Spark Foundations - Business Platform Team

by Obioma Anomnachi

In case you missed it, this blog post is a recap of Cassandra Lunch #30, covering the basics of using Spark and Cassandra together. We discuss the advantages of each and then cover the advantages of using them together. We also discuss the potential drawbacks, and configuration methods for avoiding those drawbacks. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!Cassandra & Spark SeparatelyIn CommonBoth Cassandra and Spark are proven open-source technologies. They can both be deployed to physical devices, virtual machines, containers, and cloud providers. This can be done singularly, where one cluster is entirely on a single deployment method or hybrid where they are deployed across two or more. Both Cassandra and Spark are massively distributable and scalable. They also each have an SQL-like interface, with CQL for Cassandra and SparkSQL for Spark.CassandraCassandra offers multi-region masterless data replication. It also is optimized as a data transport/communication engine. It’s a distributed NoSQL datastore.SparkSpark is a distributed analytics engine. It has APIs for Python, Scala, Java, and R. It can perform both batch and stream processing operations. Cassandra & Spark TogetherThe first method for using Spark and Cassandra together that we discussed is a Translytical Setup. In this setup, we have a multi-datacenter Cassandra setup where one datacenter is used purely for transactions (read, writes, client interactions). Data is replicated to both datacenters and the second is connected to Spark and used for Analytics.In the second method, we pull the Cassandra data into Spark as one of Spark’s core data types and use data locality to ensure speed of operations. Spark’s core data types are RDDs, Datasets, and DataFrames. Data locality is when the Spark node connected to each Cassandra node knows what data that node is responsible for via token ranges and the workload for working with the Cassandra data is distributed based on that.The last method is one in which Cassandra data is made available as SparkSQL. When SparkSQL is passed standard CQL statements, it can process with filters based on Native Indices (based on the primary key), Secondary Indices, as well as storage attached indices like SASI or SAI. It can also use all the same indices while also filtering by token range, like in the data locality example above. We can also use a Spark feature called predicate pushdown in order to filter by arbitrary tokens inside an index, alongside all of the filtering methods.If not set up carefully, problems can arise in our Spark and Cassandra environment. Cassandra data skews spill over to Spark and become Spark data skews and compute skews, where singular nodes end up with an outsized data load or computational workload. Compute skews can also happen due to poor functional programming practices. ResourcesArchitecting & Managing a Global Data & Analytics Platform Part 1: Foundation of a Business Data, Computing, & Communication Framework – Business Platform TeamSpark + Cassandra, All You Need to Know: Tips and Optimizations | by Javier Ramos | Nov, 2020 | ITNEXTCassandra and Spark: Optimizing for Data Locality – DatabricksSpark And Cassandra: 2 Fast, 2 Furious – DatabricksCassandra and SparkSQL: You Don’t Need Functional Programming for Fun…Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spit…Spark and Cassandra: An Amazing Apache Love Story – DatabricksSparkCassandraLocality.keyCassandra and Spark, closing the gap between no sql and analytics c…Cassandra.LinkCassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email! Posted in Modern Business | Comments Off on Apache Cassandra Lunch #30: Cassandra & Spark Foundations

Illustration Image

In case you missed it, this blog post is a recap of Cassandra Lunch #30, covering the basics of using Spark and Cassandra together. We discuss the advantages of each and then cover the advantages of using them together. We also discuss the potential drawbacks, and configuration methods for avoiding those drawbacks. The live recording of Cassandra Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Cassandra & Spark Separately

In Common

Both Cassandra and Spark are proven open-source technologies. They can both be deployed to physical devices, virtual machines, containers, and cloud providers. This can be done singularly, where one cluster is entirely on a single deployment method or hybrid where they are deployed across two or more. Both Cassandra and Spark are massively distributable and scalable. They also each have an SQL-like interface, with CQL for Cassandra and SparkSQL for Spark.

Cassandra

Cassandra offers multi-region masterless data replication. It also is optimized as a data transport/communication engine. It’s a distributed NoSQL datastore.

Spark

Spark is a distributed analytics engine. It has APIs for Python, Scala, Java, and R. It can perform both batch and stream processing operations. 

Cassandra & Spark Together

The first method for using Spark and Cassandra together that we discussed is a Translytical Setup. In this setup, we have a multi-datacenter Cassandra setup where one datacenter is used purely for transactions (read, writes, client interactions). Data is replicated to both datacenters and the second is connected to Spark and used for Analytics.

In the second method, we pull the Cassandra data into Spark as one of Spark’s core data types and use data locality to ensure speed of operations. Spark’s core data types are RDDs, Datasets, and DataFrames. Data locality is when the Spark node connected to each Cassandra node knows what data that node is responsible for via token ranges and the workload for working with the Cassandra data is distributed based on that.

The last method is one in which Cassandra data is made available as SparkSQL. When SparkSQL is passed standard CQL statements, it can process with filters based on Native Indices (based on the primary key), Secondary Indices, as well as storage attached indices like SASI or SAI. It can also use all the same indices while also filtering by token range, like in the data locality example above. We can also use a Spark feature called predicate pushdown in order to filter by arbitrary tokens inside an index, alongside all of the filtering methods.

If not set up carefully, problems can arise in our Spark and Cassandra environment. Cassandra data skews spill over to Spark and become Spark data skews and compute skews, where singular nodes end up with an outsized data load or computational workload. Compute skews can also happen due to poor functional programming practices.

Resources

Architecting & Managing a Global Data & Analytics Platform Part 1: Foundation of a Business Data, Computing, & Communication Framework – Business Platform Team


Spark + Cassandra, All You Need to Know: Tips and Optimizations | by Javier Ramos | Nov, 2020 | ITNEXT


Cassandra and Spark: Optimizing for Data Locality – Databricks


Spark And Cassandra: 2 Fast, 2 Furious – Databricks


Cassandra and SparkSQL: You Don’t Need Functional Programming for Fun…


Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spit…


Spark and Cassandra: An Amazing Apache Love Story – Databricks


SparkCassandraLocality.key


Cassandra and Spark, closing the gap between no sql and analytics c…

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra.lunch