In this blog, we discuss connecting Apache Spark and DataStax Astra. Also, a webinar recording is embedded below if you want to watch a live demo where we use Gitpod and Spark-Shell to connect to DataStax Astra. Stay tuned as this will be Part 1 of the “Apache Spark and DataStax Astra” series and Part 2 will cover how to run a Spark job on our DataStax Astra database.

If you are not familiar with Apache Spark, it is a unified analytics engine for large-scale data processing. Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. It offers over 80 high-level operators that make it easy to build parallel apps, and you can use it interactively from the Scala, Python, R, and SQL shells. For our demo, we utilize spark-shell using Scala. Apache Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. We will be running our instance in standalone cluster mode with 1 worker. Additionally, we can access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources. In our case, we will be accessing Cassandra data that lives in the cloud in the form of DataStax Astra.

If you have been following us here at Anant, then you know that we have been working with DataStax Astra for some time. If you are not familiar with DataStax Astra, it is cloud-native Cassandra-as-a-Service built on Apache Cassandra. DataStax Astra eliminates the overhead to install, operate, and scale Cassandra and also offers a 5 gig free-tier with no credit card required, so it is a perfect way to get started and/or play with Cassandra in the cloud.

Check out our content on DataStax Astra below!

Now we will discuss how to connect Apache Spark to the DataStax Astra using Spark-Shell. We will be using Gitpod as our dev environment so that you can replicate this task without having to worry about OS incompatibilities/issues. You can open this repo: https://github.com/adp8ke/Apache-Spark-and-DataStax-Astra in Gitpod by going to https://gitpod.io/#https://github.com/adp8ke/Apache-Spark-and-DataStax-Astra. Otherwise, you can go to the repository on Github and get started there. Once opened, we can run cd Connect and then follow the instructions on the README.md to download Apache Spark. Once we have downloaded and extracted Apache Spark, we will follow the instructions on the README.md to start Spark in the standalone cluster mode with one worker.

Now, if do not already have a DataStax Astra database, you can follow the set-up instructions on the README.md for https://github.com/Anant/cassandra.api up to Step 1.4 or “Download Secure Connect Bundle”. Once you have downloaded the Astra Secure Connect Bundle, drag-and-drop it into the “Connect” directory in Gitpod.

While we have DataStax Astra open, let’s go ahead and download this notebook file. Once the notebook is downloaded, open Studio on DataStax Astra and then drag-and-drop the notebook file into Studio. Now, we can run the cells in the notebook and set up our Astra database for when we connect to it using Apache Spark. When running the cells, do not forget to select your keyspace.

As a note, if you are doing additional research on this topic, you may find this article: https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all. In this article, they mention “Spark Cassandra Connector 2.5.0 fully supports DataStax Astra” and if you look at the Spark-Cassandra-Connector on Github, the 2.5.0 version is only compatible with Apache Spark 2.4. This posed the question of whether or not we would be able to use Apache Spark 3.0, which is compatible with Spark-Cassandra-Connector 3.0, to connect to Astra as well. After I tested the method in the link above, I began testing the method with Apache Spark 3.0 and Spark-Cassandra-Connector 3.0, which worked fine as well. As you can see below, we are using Apache Spark 3.0 and the Spark-Cassandra-Connector 3.0 to connect our Spark-Shell to our DataStax Astra Table.

Now, we can run our Spark-Shell and connect to our Astra database. We will need to run the following after running cd spark-3.0:

./bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 \
--master {master-url} \
--files /workspace/Apache-Spark-and-DataStax-Astra/Connect/secure-connect-{your-db-name}.zip \
--conf spark.cassandra.connection.config.cloud.path=secure-connect-{your-db-name}.zip \
--conf spark.cassandra.auth.username={username} \
--conf spark.cassandra.auth.password={password} \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

Don’t forget to copy in your specific database name, username, and password in the --files and --conf. If you are not doing this on Gitpod, the --files is {path}/secure-connect-{your-db-name}.zip. For the master-url in Gitpod, open Port 8080 in a new browser and you can find the master-url.

Once the shell has started, you should see something like this:

Started spark shell

Now we can run:

import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("leaves", "{your-keyspace}").load
data.printSchema
data.show

You should see something like this:

spark.read result

We could also run:

spark.conf.set(s"spark.sql.catalog.mycatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.sql("SHOW TABLES FROM mycatalog.{your-keyspace};").show
spark.sql("use mycatalog.{your-keyspace};")
spark.sql("select * from leaves").show

This would show something like this:

Spark.sql result

And that wraps up part 1 of our “Apache Spark and DataStax Astra” series. In Part 2 we will cover running an Apace Spark job against our DataStax Astra database and will be using Gitpod, Spark-Submit, SBT, and Scala for that. Again, as mentioned before, we have a webinar embedded below where you can watch this demo live. Don’t forget to like and subscribe while you are there!

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!