Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

1/28/2021

Reading time:N/A min

brianmhess/DSE-Spark-HDFS

by brianmhess

#IntroductionThe goal of this Exercise is to learn how to access data in HDFS via Spark inDSE 4.6.A simple scenario we want to address is loading data from files in HDFS in anexternal Hadoop system into Cassandra. Spark is well-situated to help.Moreover, Spark enables "blending" of data between HDFS and Cassandra.For example, we can Join data between the two sources.#ResourcesThis exercise will use DSE 4.6 and the integrated Spark that comes with it.For HDFS, we will use the Hortonworks HDP 2.2 Sandbox, which can bedownloaded from http://hortonworks.com/products/hortonworks-sandbox/We will use Webhdfs to access HDFS in the Hadoop cluster. Webhdfs is a RESTAPI for HDFS. I avoids some of the issues related to having the same HDFSlibraries on both the client and the server.You will need to clone this repo.

Illustration Image

#Introduction

The goal of this Exercise is to learn how to access data in HDFS via Spark in DSE 4.6.

A simple scenario we want to address is loading data from files in HDFS in an external Hadoop system into Cassandra. Spark is well-situated to help.
Moreover, Spark enables "blending" of data between HDFS and Cassandra.
For example, we can Join data between the two sources.

#Resources

This exercise will use DSE 4.6 and the integrated Spark that comes with it.

For HDFS, we will use the Hortonworks HDP 2.2 Sandbox, which can be downloaded from http://hortonworks.com/products/hortonworks-sandbox/

We will use Webhdfs to access HDFS in the Hadoop cluster. Webhdfs is a REST API for HDFS. I avoids some of the issues related to having the same HDFS libraries on both the client and the server.

You will need to clone this repo.

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra