Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

8/3/2018

Reading time:3 min

Yannael/kafka-sparkstreaming-cassandra

by John Doe

This Dockerfile sets up a complete streaming environment for experimenting with Kafka, Spark streaming (PySpark), and Cassandra. It installsKafka 0.10.2.1Spark 2.1.1 for Scala 2.11Cassandra 3.7It additionnally installsAnaconda distribution 4.4.0 for Python 2.7.10Jupyter notebook for PythonRun container using DockerHub imagedocker run -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandraSee following video for usage demo.Note that any changes you make in the notebook will be lost once you exit de container. In order to keep the changes, it is necessary put your notebooks in a folder on your host, that you share with the container, using for exampledocker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandraNote:The "-v pwd:/home/guest/host" shares the local folder (i.e. folder containing Dockerfile, ipynb files, etc...) on your computer - the 'host') with the container in the '/home/guest/host' folder.Port are shared as follows:4040 bridges to Spark UI8888 bridges to the Jupyter Notebook23 bridges to SSHSSH allows to get a onnection to the containerssh -p 23 guest@containerIPwhere 'containerIP' is the IP of th container (127.0.0.1 on Linux). Password is 'guest'.Start servicesOnce run, you are logged in as root in the container. Run the startup_script.sh (in /usr/bin) to startSSH server. You can connect to the container using user 'guest' and password 'guest'CassandraZookeeper serverKafka serverstartup_script.shConnect, create Cassandra table, open notebook and start streamingConnect as user 'guest' and go to 'host' folder (shared with the host)su guestStart Jupyter notebooknotebookand connect from your browser at port host:8888 (where 'host' is the IP for your host. If run locally on your computer, this should be 127.0.0.1 or 192.168.99.100, check Docker documentation)Start Kafka producerOpen kafkaSendDataPy.ipynb and run all cells.Start Kafka receiverOpen kafkaReceiveAndSaveToCassandraPy.ipynb and run cells up to start streaming. Check in subsequent cells that Cassandra collects data properly.Connect to Spark UIIt is available in your browser at port 4040The container is based on CentOS 6 Linux distribution. The main steps of the building process areInstall some common Linux tools (wget, unzip, tar, ssh tools, ...), and Java (1.8)Create a guest user (UID important for sharing folders with host!, see below), and install Spark and sbt, Kafka, Anaconda and Jupyter notbooks for the guest userGo back to root user, and install startup script (for starting SSH and Cassandra services), sentenv.sh script to set up environment variables (JAVA, Kafka, Spark, ...), spark-default.conf, and CassandraUser UIDIn the Dockerfile, the lineRUN useradd guest -u 1000creates the user under which the container will be run as a guest user. The username is 'guest', with password 'guest', and the '-u' parameter sets the linux UID for that user.In order to make sharing of folders easier between the container and your host, make sure this UID matches your user UID on the host. You can see what your host UID is withecho $UIDClone this repositorygit clone https://github.com/Yannael/kafka-sparkstreaming-cassandraBuildFrom Dockerfile folder, rundocker build -t kafka-sparkstreaming-cassandra .It may take about 30 minutes to complete.Rundocker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged kafka-sparkstreaming-cassandra

Illustration Image

This Dockerfile sets up a complete streaming environment for experimenting with Kafka, Spark streaming (PySpark), and Cassandra. It installs

  • Kafka 0.10.2.1
  • Spark 2.1.1 for Scala 2.11
  • Cassandra 3.7

It additionnally installs

  • Anaconda distribution 4.4.0 for Python 2.7.10
  • Jupyter notebook for Python

Run container using DockerHub image

docker run -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandra

See following video for usage demo.
Demo

Note that any changes you make in the notebook will be lost once you exit de container. In order to keep the changes, it is necessary put your notebooks in a folder on your host, that you share with the container, using for example

docker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged yannael/kafka-sparkstreaming-cassandra

Note:

  • The "-v pwd:/home/guest/host" shares the local folder (i.e. folder containing Dockerfile, ipynb files, etc...) on your computer - the 'host') with the container in the '/home/guest/host' folder.
  • Port are shared as follows:
    • 4040 bridges to Spark UI
    • 8888 bridges to the Jupyter Notebook
    • 23 bridges to SSH

SSH allows to get a onnection to the container

ssh -p 23 guest@containerIP

where 'containerIP' is the IP of th container (127.0.0.1 on Linux). Password is 'guest'.

Start services

Once run, you are logged in as root in the container. Run the startup_script.sh (in /usr/bin) to start

  • SSH server. You can connect to the container using user 'guest' and password 'guest'
  • Cassandra
  • Zookeeper server
  • Kafka server
startup_script.sh

Connect, create Cassandra table, open notebook and start streaming

Connect as user 'guest' and go to 'host' folder (shared with the host)

su guest

Start Jupyter notebook

notebook

and connect from your browser at port host:8888 (where 'host' is the IP for your host. If run locally on your computer, this should be 127.0.0.1 or 192.168.99.100, check Docker documentation)

Start Kafka producer

Open kafkaSendDataPy.ipynb and run all cells.

Start Kafka receiver

Open kafkaReceiveAndSaveToCassandraPy.ipynb and run cells up to start streaming. Check in subsequent cells that Cassandra collects data properly.

Connect to Spark UI

It is available in your browser at port 4040

The container is based on CentOS 6 Linux distribution. The main steps of the building process are

  • Install some common Linux tools (wget, unzip, tar, ssh tools, ...), and Java (1.8)
  • Create a guest user (UID important for sharing folders with host!, see below), and install Spark and sbt, Kafka, Anaconda and Jupyter notbooks for the guest user
  • Go back to root user, and install startup script (for starting SSH and Cassandra services), sentenv.sh script to set up environment variables (JAVA, Kafka, Spark, ...), spark-default.conf, and Cassandra

User UID

In the Dockerfile, the line

RUN useradd guest -u 1000

creates the user under which the container will be run as a guest user. The username is 'guest', with password 'guest', and the '-u' parameter sets the linux UID for that user.

In order to make sharing of folders easier between the container and your host, make sure this UID matches your user UID on the host. You can see what your host UID is with

echo $UID

Clone this repository

git clone https://github.com/Yannael/kafka-sparkstreaming-cassandra

Build

From Dockerfile folder, run

docker build -t kafka-sparkstreaming-cassandra .

It may take about 30 minutes to complete.

Run

docker run -v `pwd`:/home/guest/host -p 4040:4040 -p 8888:8888 -p 23:22 -ti --privileged kafka-sparkstreaming-cassandra

Related Articles

sstable
cassandra
spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

streaming