8/28/2018

Reading time:1 min

Spark SQL Against Cassandra Example - DZone Database

by John Doe

Spark SQL is awesome. It allows you to query any Resilient Distributed Dataset (RDD) using SQL (including data stored in Cassandra!).First thing to do is to create a SQLContext from your SparkContext. I'm using Java so... (sorry -- I'm still not hip enough for Scala) JavaSparkContext context =new JavaSparkContext(conf);JavaSQLContext sqlContext =new JavaSQLContext(context); Now you have a SQLContext, but you have no data. Go ahead and create an RDD, just like you would in regular Spark: JavaPairRDD<Integer, Product> productsRDD = javaFunctions(context).cassandraTable("test_keyspace", "products", productReader).keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); }}); (The example above comes from the spark-on-cassandra-quickstart project, as described in my previous post.) Now that we have a plain vanilla RDD, we need to spice it up with a schema, and let the sqlContext know about it. We can do that with the following lines: JavaSchemaRDD schemaRDD = sqlContext.applySchema(productsRDD.values(), Product.class); sqlContext.registerRDDAsTable(schemaRDD, "products"); Shazam. Now your sqlContext is ready for querying. Notice that it inferred the schema from the Java bean. (Product.class). (Next blog post, I'll show how to do this dynamically) You can prime the pump with a: System.out.println("Total Records = [" + productsRDD.count() + "]"); The count operation forces Spark to load the data into memory, which makes queries like the following lightning fast: JavaSchemaRDD result = sqlContext.sql("SELECT id from products WHERE price < 0.50");for (Row row : result.collect()){ System.out.println(row);} That's it. You're off to the SQL races. P.S. If you try querying the sqlContext without applying a schema and/or without registering the RDD as a table, you may see something similar to this: Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'id, tree:'Project ['id] 'Filter ('price < 0.5) NoRelation$

Read this article if you want to know more about Spark SQL Against Cassandra Example - DZone Database

Spark SQL is awesome. It allows you to query any Resilient Distributed Dataset (RDD) using SQL (including data stored in Cassandra!).

First thing to do is to create a SQLContext from your SparkContext. I'm using Java so...
(sorry -- I'm still not hip enough for Scala)

JavaSparkContext context =new JavaSparkContext(conf);
JavaSQLContext sqlContext =new JavaSQLContext(context);

Now you have a SQLContext, but you have no data. Go ahead and create an RDD, just like you would in regular Spark:

JavaPairRDD<Integer, Product> productsRDD = 
  javaFunctions(context).cassandraTable("test_keyspace", "products",
    productReader).keyBy(new Function<Product, Integer>() {
  @Override
  public Integer call(Product product) throws Exception {
    return product.getId();
  }
});

(The example above comes from the spark-on-cassandra-quickstart project, as described in my previous post.)

Now that we have a plain vanilla RDD, we need to spice it up with a schema, and let the sqlContext know about it. We can do that with the following lines:

JavaSchemaRDD schemaRDD =   sqlContext.applySchema(productsRDD.values(), Product.class);        
sqlContext.registerRDDAsTable(schemaRDD, "products");

Shazam. Now your sqlContext is ready for querying. Notice that it inferred the schema from the Java bean. (Product.class). (Next blog post, I'll show how to do this dynamically)

You can prime the pump with a:

System.out.println("Total Records = [" + productsRDD.count() + "]");

The count operation forces Spark to load the data into memory, which makes queries like the following lightning fast:

JavaSchemaRDD result = sqlContext.sql("SELECT id from products WHERE price < 0.50");
for (Row row : result.collect()){
  System.out.println(row);
}

That's it. You're off to the SQL races.

P.S. If you try querying the sqlContext without applying a schema and/or without registering the RDD as a table, you may see something similar to this:

Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'id, tree:
'Project ['id]
 'Filter ('price < 0.5)
  NoRelation$

Related Articles

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

apache

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

DataStax

8/3/2024

cassandra

langchain

llamaindex

GitHub - michelderu/chat-with-your-data-in-cassandra: Chat with your data stored in DataStax Enterprise, Astra DB and Apache Cassandra - In Natural Language!

John Doe

3/26/2024

mongo

code.generation

sqlite

GitHub - loopbackio/loopback-next: LoopBack makes it easy to build modern API applications that require complex integrations.

John Doe

1/26/2024

integration

ignite

cassandra

Ignite Cassandra Integration Usage Examples

John Doe

1/20/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

hive

firebird

oracle

DBeaver Community | Free Universal Database Tool

John Doe

5/18/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

John Doe

5/10/2023

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us