Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

10/31/2017

Reading time:5 min

DataStax Spark Cassandra Connector

by John Doe

Quick LinksWhatWherePackagesSpark Cassandra Connector Spark Packages WebsiteCommunityChat with us at DataStax Academy's #spark-connector Slack channelScala DocsMost Recent Release (2.0.5): Spark-Cassandra-Connector, Embedded-CassandraFeaturesLightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, andexecute arbitrary CQL queries in your Spark applications.Compatible with Apache Cassandra version 2.0 or higher (see table below)Compatible with Apache Spark 1.0 through 2.0 (see table below)Compatible with Scala 2.10 and 2.11Exposes Cassandra tables as Spark RDDsMaps table rows to CassandraRow objects or tuplesOffers customizable object mapper for mapping rows to objects of user-defined classesSaves RDDs back to Cassandra by implicit saveToCassandra callDelete rows and columns from cassandra by implicit deleteFromCassandra callJoin with a subset of Cassandra data using joinWithCassandraTable callPartition RDDs according to Cassandra replication using repartitionByCassandraReplica callConverts data types between Cassandra and ScalaSupports all Cassandra data types including collectionsFilters rows on the server side via the CQL WHERE clauseAllows for execution of arbitrary CQL statementsPlays nice with Cassandra Virtual NodesWorks with PySpark DataFramesVersion CompatibilityThe connector project has several branches, each of which map into differentsupported versions of Spark and Cassandra. For previous releases the branch isnamed "bX.Y" where X.Y is the major+minor version; for example the "b1.6" branchcorresponds to the 1.6 release. The "master" branch will normally containdevelopment for the next connector release in progress.ConnectorSparkCassandraCassandra Java DriverMinimum Java VersionSupported Scala Versions2.02.0, 2.1, 2.22.1.5*, 2.2, 3.03.082.10, 2.111.61.62.1.5*, 2.2, 3.03.072.10, 2.111.51.5, 1.62.1.5*, 2.2, 3.03.072.10, 2.111.41.42.1.5*2.172.10, 2.111.31.32.1.5*2.172.10, 2.111.21.22.1, 2.02.172.10, 2.111.11.1, 1.02.1, 2.02.172.10, 2.111.01.0, 0.92.02.072.10, 2.11*Compatible with 2.1.X where X >= 5Hosted API DocsAPI documentation for the Scala and Java interfaces are available online:2.0.5Spark-Cassandra-ConnectorEmbedded-Cassandra1.6.9Spark-Cassandra-ConnectorEmbedded-Cassandra1.5.2Spark-Cassandra-ConnectorSpark-Cassandra-Connector-JavaEmbedded-Cassandra1.4.5Spark-Cassandra-ConnectorSpark-Cassandra-Connector-JavaEmbedded-Cassandra1.3.1Spark-Cassandra-ConnectorSpark-Cassandra-Connector-JavaEmbedded-Cassandra1.2.0Spark-Cassandra-ConnectorSpark-Cassandra-Connector-JavaEmbedded-CassandraDownloadThis project is available on Spark Packages; this is the easiest way to start using the connector:https://spark-packages.org/package/datastax/spark-cassandra-connectorThis project has also been published to the Maven Central Repository.For SBT to download the connector binaries, sources and javadoc, put this in your projectSBT config:libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.3"The default Scala version for Spark 2.0+ is 2.11 please choose the appropriate build. See theFAQ for more informationBuildingSee Building And ArtifactsDocumentationOnline TrainingDataStax AcademyDataStax Academy provides free online training for Apache Cassandra and DataStax Enterprise. In DS320: Analytics with Spark, you will learn how to effectively and efficiently solve analytical problems with Apache Spark, Apache Cassandra, and DataStax Enterprise. You will learn about Spark API, Spark-Cassandra Connector, Spark SQL, Spark Streaming, and crucial performance optimization techniques.CommunityReporting BugsNew issues may be reported using JIRA. Please includeall relevant details including versions of Spark, Spark Cassandra Connector, Cassandra and/or DSE. A minimalreproducible case with sample code is ideal.Mailing ListQuestions and requests for help may be submitted to the user mailing list.SlackThe project uses Slack to facilitate conversation in our community. Find us in the #spark-connector channel at DataStax Academy Slack.ContributingTo protect the community, all contributors are required to sign the DataStax Spark Cassandra Connector Contribution License Agreement. The process is completely electronic and should only take a few minutes.To develop this project, we recommend using IntelliJ IDEA. Make sure you haveinstalled and enabled the Scala Plugin. Open the project with IntelliJ IDEA andit will automatically create the project structure from the provided SBTconfiguration.Tips for Developing the Spark Cassandra ConnectorChecklist for contributing changes to the project:Create a SPARKC JIRAMake sure that all unit tests and integration tests passAdd an appropriate entry at the top of CHANGES.txtIf the change has any end-user impacts, also include changes to the ./doc files as neededPrefix the pull request description with the JIRA number, for example: "SPARKC-123: Fix the ..."Open a pull-request on GitHub and await reviewTestingTo run unit and integration tests:./sbt/sbt test./sbt/sbt it:testBy default, integration tests start up a separate, single Cassandra instance and run Spark in local mode.It is possible to run integration tests with your own Cassandra and/or Spark cluster.First, prepare a jar with testing code:./sbt/sbt test:packageThen copy the generated test jar to your Spark nodes and run:export IT_TEST_CASSANDRA_HOST=<IP of one of the Cassandra nodes>export IT_TEST_SPARK_MASTER=<Spark Master URL>./sbt/sbt it:testGenerating DocumentsTo generate the Reference Document use./sbt/sbt spark-cassandra-connector-unshaded/run (outputLocation)outputLocation defaults to doc/reference.mdLicenseCopyright 2014-2017, DataStax, Inc.Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Illustration Image

Quick Links

What Where
Packages Spark Cassandra Connector Spark Packages Website
Community Chat with us at DataStax Academy's #spark-connector Slack channel
Scala Docs Most Recent Release (2.0.5): Spark-Cassandra-Connector, Embedded-Cassandra

Features

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.

This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

  • Compatible with Apache Cassandra version 2.0 or higher (see table below)
  • Compatible with Apache Spark 1.0 through 2.0 (see table below)
  • Compatible with Scala 2.10 and 2.11
  • Exposes Cassandra tables as Spark RDDs
  • Maps table rows to CassandraRow objects or tuples
  • Offers customizable object mapper for mapping rows to objects of user-defined classes
  • Saves RDDs back to Cassandra by implicit saveToCassandra call
  • Delete rows and columns from cassandra by implicit deleteFromCassandra call
  • Join with a subset of Cassandra data using joinWithCassandraTable call
  • Partition RDDs according to Cassandra replication using repartitionByCassandraReplica call
  • Converts data types between Cassandra and Scala
  • Supports all Cassandra data types including collections
  • Filters rows on the server side via the CQL WHERE clause
  • Allows for execution of arbitrary CQL statements
  • Plays nice with Cassandra Virtual Nodes
  • Works with PySpark DataFrames

Version Compatibility

The connector project has several branches, each of which map into different supported versions of Spark and Cassandra. For previous releases the branch is named "bX.Y" where X.Y is the major+minor version; for example the "b1.6" branch corresponds to the 1.6 release. The "master" branch will normally contain development for the next connector release in progress.

Connector Spark Cassandra Cassandra Java Driver Minimum Java Version Supported Scala Versions
2.0 2.0, 2.1, 2.2 2.1.5*, 2.2, 3.0 3.0 8 2.10, 2.11
1.6 1.6 2.1.5*, 2.2, 3.0 3.0 7 2.10, 2.11
1.5 1.5, 1.6 2.1.5*, 2.2, 3.0 3.0 7 2.10, 2.11
1.4 1.4 2.1.5* 2.1 7 2.10, 2.11
1.3 1.3 2.1.5* 2.1 7 2.10, 2.11
1.2 1.2 2.1, 2.0 2.1 7 2.10, 2.11
1.1 1.1, 1.0 2.1, 2.0 2.1 7 2.10, 2.11
1.0 1.0, 0.9 2.0 2.0 7 2.10, 2.11

*Compatible with 2.1.X where X >= 5

Hosted API Docs

API documentation for the Scala and Java interfaces are available online:

2.0.5

1.6.9

1.5.2

1.4.5

1.3.1

1.2.0

Download

This project is available on Spark Packages; this is the easiest way to start using the connector: https://spark-packages.org/package/datastax/spark-cassandra-connector

This project has also been published to the Maven Central Repository. For SBT to download the connector binaries, sources and javadoc, put this in your project SBT config:

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.3"
  • The default Scala version for Spark 2.0+ is 2.11 please choose the appropriate build. See the FAQ for more information

Building

See Building And Artifacts

Documentation

Online Training

DataStax Academy

DataStax Academy provides free online training for Apache Cassandra and DataStax Enterprise. In DS320: Analytics with Spark, you will learn how to effectively and efficiently solve analytical problems with Apache Spark, Apache Cassandra, and DataStax Enterprise. You will learn about Spark API, Spark-Cassandra Connector, Spark SQL, Spark Streaming, and crucial performance optimization techniques.

Community

Reporting Bugs

New issues may be reported using JIRA. Please include all relevant details including versions of Spark, Spark Cassandra Connector, Cassandra and/or DSE. A minimal reproducible case with sample code is ideal.

Mailing List

Questions and requests for help may be submitted to the user mailing list.

Slack

The project uses Slack to facilitate conversation in our community. Find us in the #spark-connector channel at DataStax Academy Slack.

Contributing

To protect the community, all contributors are required to sign the DataStax Spark Cassandra Connector Contribution License Agreement. The process is completely electronic and should only take a few minutes.

To develop this project, we recommend using IntelliJ IDEA. Make sure you have installed and enabled the Scala Plugin. Open the project with IntelliJ IDEA and it will automatically create the project structure from the provided SBT configuration.

Tips for Developing the Spark Cassandra Connector

Checklist for contributing changes to the project:

  • Create a SPARKC JIRA
  • Make sure that all unit tests and integration tests pass
  • Add an appropriate entry at the top of CHANGES.txt
  • If the change has any end-user impacts, also include changes to the ./doc files as needed
  • Prefix the pull request description with the JIRA number, for example: "SPARKC-123: Fix the ..."
  • Open a pull-request on GitHub and await review

Testing

To run unit and integration tests:

./sbt/sbt test
./sbt/sbt it:test

By default, integration tests start up a separate, single Cassandra instance and run Spark in local mode. It is possible to run integration tests with your own Cassandra and/or Spark cluster. First, prepare a jar with testing code:

./sbt/sbt test:package

Then copy the generated test jar to your Spark nodes and run:

export IT_TEST_CASSANDRA_HOST=<IP of one of the Cassandra nodes>
export IT_TEST_SPARK_MASTER=<Spark Master URL>
./sbt/sbt it:test

Generating Documents

To generate the Reference Document use

./sbt/sbt spark-cassandra-connector-unshaded/run (outputLocation)

outputLocation defaults to doc/reference.md

License

Copyright 2014-2017, DataStax, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

sql data sources