Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/27/2020

Reading time:2 min

How to Execute Spark Code on Spark Shell With Cassandra

by Piyush Rana

In this blog, we will see how to execute our Spark code on Spark shell using Cassandra. This is very efficient when it comes to testing and learning and when we have to execute our code on a Spark shell rather than doing so on an IDE.Here, we will use Spark v1.6.2.You can download the version from here and its appropriate spark Cassandra connector, spark-cassandra-connector_2.10-1.6.2.jar, can be downloaded from here.So, let's begin with an example.Create any test table in your Cassandra (I am using Cassandra v3.0.10).CREATE TABLE test_smack.movies_by_actor( actor text, release_year int, movie_id uuid, genres set, rating float, title text, PRIMARY KEY(actor, release_year, movie_id)) WITH CLUSTERING ORDER BY(release_year DESC, movie_id ASC)Insert test data:INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2010, now(), {‘ Drama’, ‘Thriller’}, 7.5, ’The Tourist’);INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2011, now(), {‘ Animated’, ‘Comedy’}, 8.5, ’Rango’);INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2012, now(), {‘ Crime’, ‘Dark Comedy’}, 6.5, ’Dark Shadows’);INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2013, now(), {‘ Adventurous’, ‘Thriller’}, 9.5, ’Transcendence’);INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2013, now(), {‘ Adventurous’, ‘Thriller’}, 6.5, ’The Lone Ranger‘);INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, title) VALUES(‘Johnny Depp’, 2014, now(), {‘ thriller’}, ’Black Mass’);Go to the path where you have kept you spark binary folder (i.e., Desktop/spark-1.6.2-bin-hadoop2.6/bin) and start Spark by including the JAR file we downloaded above.$ sudo./spark-shell –jars /PATH_TO_YOUR_CASSANDRA_CONNECTOR/spark-cassandra-connector_2.10-1.6.2.jarWhen you starts Spark using Spark shell, Spark by default creates a spark context named sc .Now, we need to do the following steps to connect our spark cluster with Cassandra: sc.stop import com.datastax.spark.connector._ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf val conf = new SparkConf(true).set(“spark.cassandra.connection.host”, “localhost”) // Here localhost is the address where your spark is running val sc = new SparkContext(conf)It's all done! Now, you can query your database and play with your results, like we are calculating the number of  Johnny Depp movies for each year:sc.cassandraTable(“keyspaeceName”,”movies_by_actor”).select(“release_year”).as((year:Int) => (year,1)).reduceByKey(_ + _).collect.foreach(println)Output:(2010,1)(2012,1)(2013,2)(2011,1)

Illustration Image

In this blog, we will see how to execute our Spark code on Spark shell using Cassandra. This is very efficient when it comes to testing and learning and when we have to execute our code on a Spark shell rather than doing so on an IDE.

Here, we will use Spark v1.6.2.You can download the version from here and its appropriate spark Cassandra connector, spark-cassandra-connector_2.10-1.6.2.jar, can be downloaded from here.

So, let's begin with an example.

Create any test table in your Cassandra (I am using Cassandra v3.0.10).

CREATE TABLE test_smack.movies_by_actor(
 actor text,
 release_year int,
 movie_id uuid,
 genres set,
 rating float,
 title text,
 PRIMARY KEY(actor, release_year, movie_id)
) WITH CLUSTERING ORDER BY(release_year DESC, movie_id ASC)

Insert test data:

INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2010, now(), {‘
 Drama’,
 ‘Thriller’
}, 7.5, ’The Tourist’);
INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2011, now(), {‘
 Animated’,
 ‘Comedy’
}, 8.5, ’Rango’);
INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2012, now(), {‘
 Crime’,
 ‘Dark Comedy’
}, 6.5, ’Dark Shadows’);
INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2013, now(), {‘
 Adventurous’,
 ‘Thriller’
}, 9.5, ’Transcendence’);
INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, rating, title) VALUES(‘Johnny Depp’, 2013, now(), {‘
 Adventurous’,
 ‘Thriller’
}, 6.5, ’The Lone Ranger‘);
INSERT INTO movies_by_actor(actor, release_year, movie_id, genres, title) VALUES(‘Johnny Depp’, 2014, now(), {‘
 thriller’
}, ’Black Mass’);

Go to the path where you have kept you spark binary folder (i.e., Desktop/spark-1.6.2-bin-hadoop2.6/bin) and start Spark by including the JAR file we downloaded above.

$ sudo./spark-shell –jars /PATH_TO_YOUR_CASSANDRA_CONNECTOR/spark-cassandra-connector_2.10-1.6.2.jar

When you starts Spark using Spark shell, Spark by default creates a spark context named sc .

Now, we need to do the following steps to connect our spark cluster with Cassandra:

  sc.stop
  import com.datastax.spark.connector._
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkContext._
  import org.apache.spark.SparkConf
  val conf = new SparkConf(true).set(“spark.cassandra.connection.host”, “localhost”)
  // Here localhost is the address where your spark is running
  val sc = new SparkContext(conf)

It's all done! Now, you can query your database and play with your results, like we are calculating the number of  Johnny Depp movies for each year:

sc.cassandraTable(“keyspaeceName”,”movies_by_actor”).select(“release_year”).as((year:Int) => (year,1)).reduceByKey(_ + _).collect.foreach(println)

Output:

(2010,1)
(2012,1)
(2013,2)
(2011,1)
image

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra