1/28/2021

Reading time:N/A min

brianmhess/DSE-Spark-HDFS

by brianmhess

#IntroductionThe goal of this Exercise is to learn how to access data in HDFS via Spark inDSE 4.6.A simple scenario we want to address is loading data from files in HDFS in anexternal Hadoop system into Cassandra. Spark is well-situated to help.Moreover, Spark enables "blending" of data between HDFS and Cassandra.For example, we can Join data between the two sources.#ResourcesThis exercise will use DSE 4.6 and the integrated Spark that comes with it.For HDFS, we will use the Hortonworks HDP 2.2 Sandbox, which can bedownloaded from http://hortonworks.com/products/hortonworks-sandbox/We will use Webhdfs to access HDFS in the Hadoop cluster. Webhdfs is a RESTAPI for HDFS. I avoids some of the issues related to having the same HDFSlibraries on both the client and the server.You will need to clone this repo.

Read this article if you want to know more about brianmhess/DSE-Spark-HDFS

#Introduction

The goal of this Exercise is to learn how to access data in HDFS via Spark in DSE 4.6.

A simple scenario we want to address is loading data from files in HDFS in an external Hadoop system into Cassandra. Spark is well-situated to help.
Moreover, Spark enables "blending" of data between HDFS and Cassandra.
For example, we can Join data between the two sources.

#Resources

This exercise will use DSE 4.6 and the integrated Spark that comes with it.

For HDFS, we will use the Hortonworks HDP 2.2 Sandbox, which can be downloaded from http://hortonworks.com/products/hortonworks-sandbox/

We will use Webhdfs to access HDFS in the Hadoop cluster. Webhdfs is a REST API for HDFS. I avoids some of the issues related to having the same HDFS libraries on both the client and the server.

You will need to clone this repo.

Related Articles

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

apache

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

DataStax

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

John Doe

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

John Doe

2/17/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

John Doe

11/4/2022

cassandra

spark

Can Spark Applications Coexist with NoSQL Databases? | Capital One

John Doe

11/4/2022

proxy

cassandra

spark

Migrate to Azure Managed Instance for Apache Cassandra using Apache Spark

TheovanKraay

8/18/2022

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company