Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/24/2021

Reading time:2 min

jparkie/Spark2Cassandra

by jparkie

Spark Library for Bulk Loading into CassandraRequirementsSpark2Cassandra supports Spark 1.5 and above.Spark2Cassandra VersionCassandra Version2.1.X2.1.5+2.2.X2.1.XDownloadsSBTlibraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.1.0"Or:libraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.2.0"Add the following resolver if needed:resolvers += "Sonatype OSS Releases" at "https://oss.sonatype.org/content/repositories/releases"resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"Maven<dependency> <groupId>com.github.jparkie</groupId> <artifactId>spark2cassandra_2.10</artifactId> <version>x.y.z-SNAPSHOT</version></dependency>It is planned for Spark2Cassandra to be available on the following:http://spark-packages.org/FeaturesUsageBulk Loading into Cassandra// Import the following to have access to the `bulkLoadToEs()` function for RDDs or DataFrames.import com.github.jparkie.spark.cassandra.rdd._import com.github.jparkie.spark.cassandra.sql._val sparkConf = new SparkConf()val sc = SparkContext.getOrCreate(sparkConf)val sqlContext = SQLContext.getOrCreate(sc)val rdd = sc.parallelize(???)val df = sqlContext.read.parquet("<PATH>")// Specify the `keyspaceName` and the `tableName` to write.rdd.bulkLoadToCass( keyspaceName = "twitter", tableName = "tweets_by_date")// Specify the `keyspaceName` and the `tableName` to write.df.bulkLoadToCass( keyspaceName = "twitter", tableName = "tweets_by_author")Refer to for more: SparkCassRDDFunction.scalaRefer to for more: SparkCassDataFrameFunctions.scalaConfigurationsAs Spark2Cassandra utilizes https://github.com/datastax/spark-cassandra-connector for serializations from Spark and session management, please refer to the following for more configurations: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md.SparkCassWriteConfRefer to for more: SparkCassWriteConf.scalaProperty NameDefaultDescriptionspark.cassandra.bulk.write.partitionerorg.apache.cassandra.dht.Murmur3PartitionerThe 'partitioner' defined in cassandra.yaml.spark.cassandra.bulk.write.throughput_mb_per_secInt.MaxValueThe maximum throughput to throttle.spark.cassandra.bulk.write.connection_per_host1The number of connections per host to utilize when streaming SSTables.SparkCassServerConfRefer to for more: SparkCassServerConf.scalaProperty NameDefaultDescriptionspark.cassandra.bulk.server.storage.port7000The 'storage_port' defined in cassandra.yaml.spark.cassandra.bulk.server.sslStorage.port7001The 'ssl_storage_port' defined in cassandra.yaml.spark.cassandra.bulk.server.internode.encryption"none"The 'server_encryption_options:internode_encryption' defined in cassandra.yaml.spark.cassandra.bulk.server.keyStore.pathconf/.keystoreThe 'server_encryption_options:keystore' defined in cassandra.yaml.spark.cassandra.bulk.server.keyStore.passwordcassandraThe 'server_encryption_options:keystore_password' defined in cassandra.yaml.spark.cassandra.bulk.server.trustStore.pathconf/.truststoreThe 'server_encryption_options:truststore' defined in cassandra.yaml.spark.cassandra.bulk.server.trustStore.passwordcassandraThe 'server_encryption_options:truststore_password' defined in cassandra.yaml.spark.cassandra.bulk.server.protocolTLSThe 'server_encryption_options:protocol' defined in cassandra.yaml.spark.cassandra.bulk.server.algorithmSunX509The 'server_encryption_options:algorithm' defined in cassandra.yaml.spark.cassandra.bulk.server.store.typeJKSThe 'server_encryption_options:store_type' defined in cassandra.yaml.spark.cassandra.bulk.server.cipherSuitesTLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHAThe 'server_encryption_options:cipher_suites' defined in cassandra.yaml.spark.cassandra.bulk.server.requireClientAuthfalseThe 'server_encryption_options:require_client_auth' defined in cassandra.yaml.DocumentationScaladocs are currently unavailable.

Illustration Image

Spark Library for Bulk Loading into Cassandra

Build Status

Requirements

Spark2Cassandra supports Spark 1.5 and above.

Spark2Cassandra Version Cassandra Version
2.1.X 2.1.5+
2.2.X 2.1.X

Downloads

SBT

libraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.1.0"

Or:

libraryDependencies += "com.github.jparkie" %% "spark2cassandra" % "2.2.0"

Add the following resolver if needed:

resolvers += "Sonatype OSS Releases" at "https://oss.sonatype.org/content/repositories/releases"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

Maven

<dependency>
  <groupId>com.github.jparkie</groupId>
  <artifactId>spark2cassandra_2.10</artifactId>
  <version>x.y.z-SNAPSHOT</version>
</dependency>

It is planned for Spark2Cassandra to be available on the following:

Features

Usage

Bulk Loading into Cassandra

// Import the following to have access to the `bulkLoadToEs()` function for RDDs or DataFrames.
import com.github.jparkie.spark.cassandra.rdd._
import com.github.jparkie.spark.cassandra.sql._
val sparkConf = new SparkConf()
val sc = SparkContext.getOrCreate(sparkConf)
val sqlContext = SQLContext.getOrCreate(sc)
val rdd = sc.parallelize(???)
val df = sqlContext.read.parquet("<PATH>")
// Specify the `keyspaceName` and the `tableName` to write.
rdd.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_date"
)
// Specify the `keyspaceName` and the `tableName` to write.
df.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_author"
)

Refer to for more: SparkCassRDDFunction.scala Refer to for more: SparkCassDataFrameFunctions.scala

Configurations

As Spark2Cassandra utilizes https://github.com/datastax/spark-cassandra-connector for serializations from Spark and session management, please refer to the following for more configurations: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md.

SparkCassWriteConf

Refer to for more: SparkCassWriteConf.scala

Property Name Default Description
spark.cassandra.bulk.write.partitioner org.apache.cassandra.dht.Murmur3Partitioner The 'partitioner' defined in cassandra.yaml.
spark.cassandra.bulk.write.throughput_mb_per_sec Int.MaxValue The maximum throughput to throttle.
spark.cassandra.bulk.write.connection_per_host 1 The number of connections per host to utilize when streaming SSTables.

SparkCassServerConf

Refer to for more: SparkCassServerConf.scala

Property Name Default Description
spark.cassandra.bulk.server.storage.port 7000 The 'storage_port' defined in cassandra.yaml.
spark.cassandra.bulk.server.sslStorage.port 7001 The 'ssl_storage_port' defined in cassandra.yaml.
spark.cassandra.bulk.server.internode.encryption "none" The 'server_encryption_options:internode_encryption' defined in cassandra.yaml.
spark.cassandra.bulk.server.keyStore.path conf/.keystore The 'server_encryption_options:keystore' defined in cassandra.yaml.
spark.cassandra.bulk.server.keyStore.password cassandra The 'server_encryption_options:keystore_password' defined in cassandra.yaml.
spark.cassandra.bulk.server.trustStore.path conf/.truststore The 'server_encryption_options:truststore' defined in cassandra.yaml.
spark.cassandra.bulk.server.trustStore.password cassandra The 'server_encryption_options:truststore_password' defined in cassandra.yaml.
spark.cassandra.bulk.server.protocol TLS The 'server_encryption_options:protocol' defined in cassandra.yaml.
spark.cassandra.bulk.server.algorithm SunX509 The 'server_encryption_options:algorithm' defined in cassandra.yaml.
spark.cassandra.bulk.server.store.type JKS The 'server_encryption_options:store_type' defined in cassandra.yaml.
spark.cassandra.bulk.server.cipherSuites TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA The 'server_encryption_options:cipher_suites' defined in cassandra.yaml.
spark.cassandra.bulk.server.requireClientAuth false The 'server_encryption_options:require_client_auth' defined in cassandra.yaml.

Documentation

Scaladocs are currently unavailable.

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra