This blog is the second part of a series “Spark — the Pragmatic bits”. Get the full overview here.
My recent blogpost I explored a few cases where using Cassandra and Spark together can be useful. My focus was on the functional behaviour of such a stack and what you need to do as a developer to interact with it. However, it did not describe any details about the infrastructure setup that is capable of running such Spark code or any deployment considerations. In this post, I will explore this in more detail and show some practical advice in how to deploy Spark and Apache Cassandra.
The simplest and most obvious choice is to get DataStax Enterprise which contains Cassandra, Spark with a highly available Spark Master, and many more components bundled up. DataStax has good documentation about how to install and configure their solution: https://docs.datastax.com/en/latest-dse/
If you are already a user of DSE or considering adopting it, this is definitely the way to go.
On the other hand, the components to configure such a setup from scratch are all available as open source software. Going through the process also helps understanding how DSE ultimately works under the hood.
To do this, the following components are needed:
Be careful about the various versions of frameworks and libraries. There is a good “version compatibility” matrix on the GitHub wiki of the Spark-Cassandra connector. At the time of writing, the following versions were used:
- Cassandra 3.10
- Scala 2.11.8
- Spark 2.1.0
- Spark-Cassandra connector 2.0.0 (for Scala 2.11)
In order to keep things simple for this blogpost, I am going to use IP addresses alone to configure everything. In any production deployment you should consider using DNS names instead.
Let’s take a look at the process step-by-step.
First, we are going to need a running Cassandra cluster. The latest version is available for download from the Apache Cassandra website here: https://cassandra.apache.org/download/. We start with a simple Cassandra deployment, where, in the simple case there is no need for any extra configuration (see “Isolating workloads” section for more production-like recommendations).
Having configured and started the cluster, using the nodetool we should be able to view our state which should look as follows:
[centos@ip-10-0-10-73 ~]$ nodetool status
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.0.12.48 266.17 KiB 256 32.2% 01e17441-61c3-4179-a9aa-cbe1795a0259 rack1
UN 10.0.11.65 272.64 KiB 256 34.4% 99f6d8c8-8c01-46c4-99b9-665b8bb98b4c rack1
UN 10.0.10.73 247.37 KiB 256 33.4% 1d6edc64-f625-4613-b8b9-f4d0131aa9ff rack1
To enable us to run a Spark jobs in a distributed fashion, we are going to need some kind of processing cluster overlaid on the Cassandra nodes. Spark is to a certain degree agnostic to what resource management framework we use — out of the box it can run on Mesos, YARN or it’s own standalone cluster manager. The simplest approach is to just use a standalone Spark cluster as it’s easy to set up and will do the job.
The diagram below shows a running Spark job along with the major components involved: