This tutorial is going to go through the steps required to install Cassandra and Spark on a Debian system and how to get them to play nice via Scala. Spark and Cassanrda exist for the sake of applications in Big Data, as such they are intended for installation on a cluster of computers possibly spread over multiple geographic locations. This tutorial, however, will deal with a single computer installation. The aim of this tutorial is to give you a starting point from which to configure your cluster for your specific application, and give you a few ways to make sure your software is running correctly.
Major Components
This tutorial touches on quite a few different technologies. Below is a little description for each of the major ones we'll be dealing with here.
Spark
Spark has been described as the swiss army knife of big data, but what does that mean? Spark started off as a replacement for Hadoop, and Hadoop is a sort of industry standard tool for doing large scale distributed map-reduce calculations. Hadoop solved a bunch of problems - it took one kind of algorithm and turned it into a distributed production line thus creating an efficient and robust system for solving very specific kinds of problems. Hadoop was initially a tool (well, actually it was first a small yellow elephant, then it was a tool), and then the word started being used to refer to an ecosystem of compatible tools.
Hadoop is awesome. But it has its problems. Firstly, it can be a bit horrible to use. Writing map-reduce code can be tedious and has a lot of room for misguidedness - not everyone can do it. A few scripting tools such as Apache Pig have been developed in order to abstract away from this, but it's still a problem. Besides that, it only does map-reduce - there are many big data problems that simply cannot fit into that paradigm (or whatever you want to call it). Also, it's slow. Slower than it could be anyway - the technical details are beyond the scope of this tutorial though.
Spark is a relatively recent addition to the Hadoop ecosystem and works to solve a few of the problems of vanilla Hadoop. For one thing, it is easier to use, allowing users to specify map-reduce jobs by simply opening up a shell and writing code that is generally readable, maintainable and quick to write. Users can thus execute ad-hoc queries or submit larger jobs. Spark aims to make better use of system resources, especially RAM, and is touted as being 10 times faster than Hadoop for some workloads. Spark also does stuff that doesn't fall into the map-reduce way of thinking - for example, it allows for iterative processing, something vanilla Hadoop is ill-suited for. Spark also works with any Hadoop compatible storage, that makes converting from Hadoop to Spark isn't quite as hideous as it could be. On top of all this, Spark is the Apache Foundations top project. And that is friggin awesome on its own.
Cassandra
Cassandra is a distributed database based on Google BigTable and Amazon's Dynamo. Like other Big Data databases, it allows for malleable data structures. One of Cassandra's coolest features is the fact that it scales in a predictable way - every single node on a Cassandra cluster has the same role and works in the same way, there is no central gate keeper type node through which all traffic must pass or anything like that. No single node is special so no single node becomes a bottleneck for overall cluster performance, and there is no single point of failure. And that is pretty wonderful.
Besides that, Cassandra provides many of the same guarantees as other Hadoop-y databases - it works on a cluster and can span multiple data centers; it can talk to Hadoop and Spark in such a way as to maintain data-locality mechanisms (although the importance of this aspect is questionable); the format of the data it stores is malleable to an extent; it provides robust storage... all the good stuff.
Scala
Scala is a scripting language that runs on the JVM. Spark talks to many different languages, but the Spark-Cassandra connector we are going to use likes Scala best. Scala is cool for a bunch of reasons, the fact that it runs on the JVM means that Scala components can be incorporated into Java software, and Java components into Scala. Besides that it is a lot faster to write than Java, removing the need for a lot of Java's annoying boiler plate requirements.
Installation
Now we've covered the basics, time to install this stuff. You'll need sudo access. The steps that follow work fine on a fresh Ubuntu 14.04 server.
Let's get cracking.
Install Prerequisites
First up we'll need Java installed. Oracle Java 7 is the most stable version to use in this setup at the time of writing. This expects a fresh system without any nasty unwanted Java bits and pieces installed. Open up a terminal and type this stuff in:
sudo apt-get install software-properties-common sudo apt-add-repository ppa:webupd8team/java
You may be asked to press [ENTER] at this point.
sudo apt-get update sudo apt-get install oracle-java7-installer
Agree to the license if you dare. Now just check that it was installed correctly:
java -version
This should execute without error and output some text that looks somewhat like:
java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
Installing Cassandra
There are other ways to do this, but this is the simplest.
First we add the DataStax community repository, and tell your system that it is trusted:
echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -
Next up we do our install:
sudo apt-get update sudo apt-get install dsc21=2.1.9-1 cassandra=2.1.9 cassandra-tools=2.1.9
Once the installation is complete Cassandra will be running and it will be associated with some default data. If you want to change any major configuration of your cluster then it would be best to do so before continuing. Configuring a Cassandra cluster is beyond the scope of this text.
Stopping and starting a node's Cassandra service can be achieved like so:
sudo service cassandra stop sudo service cassandra start
And to see if your Cassandra cluster is up and running use the following command:
nodetool status
This should output something like:
Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 104.55 KB 256 100.0% a6f88d74-d436-4669-9065-6854543598b3 rack1
CQL
CQL is Cassandra's version of SQL. Once you have successfully installed Cassandra type cqlsh
to enter the cql shell. This is not a full of CQL tutorial. You can run this stuff to verify that everything you've done so far works.
Open up a shell and try this out:
// Create a keyspace CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; // create a table in that keyspace USE "test"; CREATE TABLE my_table(key text PRIMARY KEY, value int); // store some goodies INSERT INTO my_table(key, value) VALUES ('key1', 1); INSERT INTO my_table(key, value) VALUES ('key2', 2); // and retrieve them again SELECT * from my_table;
There shouldn't be any errors.
Installing Spark
This one is nice and straight forward...
Download Spark and decompress it:
wget http://apache.is.co.za//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz tar xvzf spark-1.4.1-bin-hadoop2.6.tgz
You can move that to wherever you like. Now let's test it out. Open up a spark shell by doing a cd into your spark directory and then:
bin/spark-shell
This will take a few seconds and there will be a lot of log output. You'll eventually get presented with a Scala prompt. Now let's get spark to do a calculation for us:
sc.parallelize( 1 to 50 ).sum()
This will eventually output the result 1275.
The Spark Cassandra Connector
Spark doesn't natively know how to talk Cassandra, but it's functionality can be extended through use of connectors. Lucky for us the nice people at DataStax have produced one and it is available for download from GitHub.
You could install git and do a clone. Like so:
sudo apt-get install git git clone https://github.com/datastax/spark-cassandra-connector.git
If you have no idea what that sentence means then you can follow this link and download the latest zip, and unzip it. You can also spend some time learning about Git.
Once it's cloned it then we'll need to build it:
cd spark-cassandra-connector git checkout v1.4.0 ./sbt/sbt assembly
Note that we are checking out v1.4.0. At time of writing the latest version has some build issues. This one works.
Go get yourself a cup of tea. You've deserved it. Maybe two cups, this is going to take a while.
When the build is finally finished there will be two jar files in a directory named “target”, one for Scala and one for Java. We are interested in the Scala one. It's good to have the jar accessible via a path that is easy to remember. For now let's copy it to your home directory. You can put it wherever you want really.
cp spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.4.0-SNAPSHOT.jar ~
Using The Connector
Now we have all the bits and pieces sorted out. Start the spark shell again (from within your spark directory), but this time load up the jar:
bin/spark-shell --jars ~/spark-cassandra-connector-assembly-1.4.0-SNAPSHOT.jar
Now enter the following at the scala prompt:
sc.stop import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost") val sc = new SparkContext(conf)
This takes the spark context and replaces it with one that is connected to your local cassandra.
Now let's take it for a spin. Remember we made a keyspace called test
and a table called my_table
? We'll be making use of those again now. Enter the following in the scala shell:
val test_spark_rdd = sc.cassandraTable("test", "my_table") test_spark_rdd.first
Isn't that lovely.
Conclusion
We've taken a fresh Ubuntu installation and set up Cassandra and Spark and gotten then to talk. That's quite a feat on it's own. But to make them talk in a way that is actually useful has many implications. Firstly, we've got everything set up on a single computer. It works but the real strength of these technologies comes from the fact that they are aimed at solving problems within the sphere of big data. They should be installed on a cluster, possibly a multi-data-centre cluster.
Besides installation, there is a lot to be said about how Cassandra actually works - it is a very configurable database and can be optimised for all sorts of workloads. Then there is the topic of schema design and optimisation. The little bit of CQL we covered in this tutorial barely scratches the surface of Cassandra's capabilities. If you want to use it in any serious way it'll be best to spend some time digging into how it works.
Spark also deserves more attention than this tutorial could give it. for example, did you know that you can use a Python-based spark shell (called PySpark)? Unfortunately at the time of writing Python support for the Cassandra connector was called "experimental". Meh. If you want to use Spark in any useful way it would at least be useful to learn about the spark context, and what can be done with an RDD.
That's all folks.