Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

4/28/2021

Reading time:4 min

Apache Cassandra Lunch #46: Apache Spark Jobs in Scala for Cassandra Data Operations - Business Platform Team

by John Doe

In Apache Cassandra Lunch #46: Apache Spark Jobs in Scala for Cassandra Data Operations, we discuss how we can do Apache Spark jobs in Scala Cassandra data operations. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!In Apache Cassandra Lunch #46, we discuss how we can use Apache Spark jobs written in Scala to do Cassandra data operations. We have a walkthrough to show you how you can run Apache Spark jobs to do some Cassandra Data operations below, but also check out this blog for an additional walkthrough on how to do other Cassandra data operations that we did not cover in this Apache Cassandra Lunch session. The live recording embedded below contains a live demo as well, so be sure to watch that as well!WalkthroughIn this walkthrough, we will run a few different spark jobs to do some ETL data operations of Cassandra data. You can follow along on this blog, or check out this GitHub repo and follow along with the README.md there. PrerequisitesDockersbtApache Spark 3.0.x1. Build Fat JAR1.1 – Clone repo and cd into itgit clone https://github.com/Anant/example-cassandra-spark-job-scala.gitcd example-cassandra-spark-job-scala1.2 – Start sbt server in directorysbt1.3 – Run assembly in sbt serverassembly2. Navigate to Spark Directory and Start Spark2.1 – Start Master./sbin/start-master.sh2.2 – Get Master URLNavigate to localhost:8080 and copy the master URL.2.3 – Start Worker./sbin/start-slave.sh <master-url>3. Start Apache Cassandra Docker Containerdocker run --name cassandra -p 9042:9042 -d cassandra:latest3.1 – Run CQLSHdocker exec -it cassandra CQLSH3.2 – Create demo keyspaceCREATE KEYSPACE demo WITH REPLICATION={'class': 'SimpleStrategy', 'replication_factor': 1};4. Read Spark JobIn this job, we will look at a CSV with 100,000 records and load it into a dataframe. Once read, we will display the first 20 rows../bin/spark-submit --class sparkCassandra.Read \--master <master-url> \--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar5. Manipulate Spark JobIn this job, we will do the same read; however, we will now take the first_day and last_day columns and calculate the absolute value difference in days worked. Again, then display the top 20 rows../bin/spark-submit --class sparkCassandra.Manipulate \--master <master-url> \--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar6. Write to Cassandra Spark JobIn this job, we will do the same thing we did in the manipulate job; however, we will now write the outputted dataframe to Cassandra instead of just displaying it to the console../bin/spark-submit --class sparkCassandra.Write \--master <master-url> \--conf spark.cassandra.connection.host=127.0.0.1 \--conf spark.cassandra.connection.port=9042 \--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar7. SparkSQL Spark JobIn this job, we will write the CSV data into one Cassandra table and then pick it up using SparkSQL and transform it at the same time. We will then write the newly transformed data into a new Cassandra table../bin/spark-submit --class sparkCassandra.ETL \--master <master-url> \--conf spark.cassandra.connection.host=127.0.0.1 \--conf spark.cassandra.connection.port=9042 \--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jarAnd that will wrap up the walkthrough on how to do some Cassandra data operations with Apache Spark jobs. Again, check out this blog as well for more Cassandra data operations that we can do with Apache Spark. As mentioned above, the live recording which includes a live walkthrough of this demo is embedded below, so be sure to check it out and subscribe to keep up to date with Cassandra.LinkCassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email! Posted in Modern Business | Comments Off on Apache Cassandra Lunch #46: Apache Spark Jobs in Scala for Cassandra Data Operations

Illustration Image

In Apache Cassandra Lunch #46: Apache Spark Jobs in Scala for Cassandra Data Operations, we discuss how we can do Apache Spark jobs in Scala Cassandra data operations. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

In Apache Cassandra Lunch #46, we discuss how we can use Apache Spark jobs written in Scala to do Cassandra data operations. We have a walkthrough to show you how you can run Apache Spark jobs to do some Cassandra Data operations below, but also check out this blog for an additional walkthrough on how to do other Cassandra data operations that we did not cover in this Apache Cassandra Lunch session. The live recording embedded below contains a live demo as well, so be sure to watch that as well!

Walkthrough

In this walkthrough, we will run a few different spark jobs to do some ETL data operations of Cassandra data. You can follow along on this blog, or check out this GitHub repo and follow along with the README.md there.

Prerequisites

  • Docker
  • sbt
  • Apache Spark 3.0.x

1. Build Fat JAR

1.1 – Clone repo and cd into it

git clone https://github.com/Anant/example-cassandra-spark-job-scala.git
cd example-cassandra-spark-job-scala

1.2 – Start sbt server in directory

sbt

1.3 – Run assembly in sbt server

assembly

2. Navigate to Spark Directory and Start Spark

2.1 – Start Master

./sbin/start-master.sh

2.2 – Get Master URL

Navigate to localhost:8080 and copy the master URL.

2.3 – Start Worker

./sbin/start-slave.sh <master-url>

3. Start Apache Cassandra Docker Container

docker run --name cassandra -p 9042:9042 -d cassandra:latest

3.1 – Run CQLSH

docker exec -it cassandra CQLSH

3.2 – Create demo keyspace

CREATE KEYSPACE demo WITH REPLICATION={'class': 'SimpleStrategy', 'replication_factor': 1};

4. Read Spark Job

In this job, we will look at a CSV with 100,000 records and load it into a dataframe. Once read, we will display the first 20 rows.

./bin/spark-submit --class sparkCassandra.Read \
--master <master-url> \
--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \
/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar

5. Manipulate Spark Job

In this job, we will do the same read; however, we will now take the first_day and last_day columns and calculate the absolute value difference in days worked. Again, then display the top 20 rows.

./bin/spark-submit --class sparkCassandra.Manipulate \
--master <master-url> \
--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \
/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar

6. Write to Cassandra Spark Job

In this job, we will do the same thing we did in the manipulate job; however, we will now write the outputted dataframe to Cassandra instead of just displaying it to the console.

./bin/spark-submit --class sparkCassandra.Write \
--master <master-url> \
--conf spark.cassandra.connection.host=127.0.0.1 \
--conf spark.cassandra.connection.port=9042 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \
/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar

7. SparkSQL Spark Job

In this job, we will write the CSV data into one Cassandra table and then pick it up using SparkSQL and transform it at the same time. We will then write the newly transformed data into a new Cassandra table.

./bin/spark-submit --class sparkCassandra.ETL \
--master <master-url> \
--conf spark.cassandra.connection.host=127.0.0.1 \
--conf spark.cassandra.connection.port=9042 \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
--files /path/to/example-cassandra-spark-job-scala/previous_employees_by_title.csv \
/path/to/example-cassandra-spark-job-scala/target/scala-2.12/example-cassandra-spark-job-scala-assembly-0.1.0-SNAPSHOT.jar

And that will wrap up the walkthrough on how to do some Cassandra data operations with Apache Spark jobs. Again, check out this blog as well for more Cassandra data operations that we can do with Apache Spark. As mentioned above, the live recording which includes a live walkthrough of this demo is embedded below, so be sure to check it out and subscribe to keep up to date with

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

sbt