5/21/2021

Reading time:7 min

How to Build a Distributed Big Data Pipeline Using Kafka and Docker

by John Doe

A microservice architecture is different, where the entire system is modularized as much as possible by identifying specific tasks that can be pulled out from the core of the system and performed by a much smaller component that is responsible for only that single task. Components with such narrow scope and well defined functionality are more naturally referred to as “services” -- hence the term “microservice”.Although microservices can help engineers avoid many of the issues that may arise in a traditionally monolith, they do come with their own set of new challenges. You may have come across this image before depicting the differences between the two architectures:However, in the case of a data pipeline, we believe thinking from a microservice architecture perspective is the right approach, and will explain how we tackle this problem with the help of Kafka and Docker. For a more thorough discussion about this design choice, see some of the linked articles in Further Readings section at the end of this article.While it is true that building a fault tolerant, distributed, real time stream processing data pipeline using a microservice-based architecture may seem rather ambitious to cover in a single “How To” blog post luckily most of the heavy lifting has already been done by the smart folks over at Confluent when they originally built Apache Kafka.Kafka is an open source event stream processing platform based on an abstraction of a distributed commit log. The aim of Kafka is to provide a unified, high-throughput, low-latency platform for managing these logs, which Kafka calls “topics”. Kafka combines three main capabilities to aid the entire process of implementing various event streaming use cases:To publish (write) and subscribe to (read) streams of events.To store streams of events durably and reliably for as long as you want.To process streams of events as they occur or retrospectively.For more information on Kafka, the best introduction is found on the official webpage. Give it a read -- it is very well written and fairly easy to digest.Aside: Confluent Kafka or Apache Kafka?As previously mentioned, the founders Confluent are the original developers of Apache Kafka, and offer an alternative and significantly more complete distribution of Kafka together with Confluent Platform. Many of the additional features present in Confluent’s distribution of Kafka are available to use for free, and are what Confluent refers to as “community components''. For our example application, we will be using one of these community components by Confluent, but rely on the standard Apache Kafka distribution for our underlying Kafka cluster, as it is more than capable of supporting our system. We encourage the reader to take a look at using Confluent’s distribution of Kafka if they intend on using Kafka in production, as there are many benefits to be considered which are beyond the scope of this article.In short, Docker offers a way for developers to isolate and deliver software, together with all the dependencies required for that software to run in virtualized packages called containers. These containers are isolated from one another and also from the host machine in which they are running. However, containers allow for communication between one another and with the host through well defined channels. In contrast to virtual machines, containers all share the same resources of the host, and are thus much more light weight, and in turn are easier to distribute and replicate when needed.We will assume that the reader is already somewhat familiar with docker, and in this post we will highlight some of the more advanced topics of using docker -- such as isolating a set of containers on a specific user-defined bridge network, and using docker-compose commands to manage services at an individual, and application focused level. For more information, see the official website, as well as some additional resources included in the Further readings section at the end of this article.Another technology we’ll be utilizing from the Apache ecosystem is Apache Cassandra. Cassandra is an open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data and be fault tolerant.For the sake of brevity, we’ll spare the details and just say that we’ve chosen to use Apache Cassandra mainly due to its fault tolerance and scalability capabilities. These make Cassandra an excellent fit for our slew of other big data tools we are using to build a big data pipeline. Again, for more information, you know where to go: the official webpage and the Further Readings section at the end of this article.Overall, our data pipeline architecture is as follows:Containerized microservices communicating over Kafka message queuesPhew. Enough with the background info. Let’s start building something already!Each component of the application will be containerized in docker. Here is a diagram depicting the connectivity between the various services, each isolated in their own docker containers. We will be referring to this diagram throughout the step-by-step guide to try and keep things grounded.Docker networking and container connectivity diagramFirst, you need to clone the Git repository from: https://github.com/salcaino/sfucmpt733.gitEvery command is prepared to be run from the parent directory in the repository. We’ll start services following the logical order from the base of the pipeline to the pipeline’s final output in terms of visualization.The next step is to create two user-defined docker bridge networks. We will use these networks to provide an abstract separation between containers that require direct communication with the Kafka cluster, (Kafka servers and clients), and those that do not, (Cassandra and the data visualization service). For more background information on docker networking, please refer to the resources linked at the bottom of this article, and be sure to take a closer look at our different compose files that define which services reside in which networks. To define the networks, run the following commands from the top level directory of the project:$ docker network create kafka-network$ docker network create cassandra-networkFirst we will bring up our database given that it does not depend logically on any other module. Cassandra is storing pre-processed data coming into the pipeline through the Kafka connect sink service and serving as a data source for our visualization tools.How:It runs in a docker container with a persistent volume for its data directories.It needs to be setup with Keyspace and Tables on the first time it runs, a script called ‘bootstrap.sh’ looks for a flag file to know if it needs to kick into setup the database using CQL files. This script only runs the first time.How to run the service:$ docker-compose -f cassandra/docker-compose.yml up -dThis command will start Cassandra and setup everything the first time it runs.How to verify the service is working:Using `docker ps` should show a container called Cassandra and if you watch the container logs it should say “Startup Complete” at the end of it.We will be using the wurstmeister/kafka-docker docker image available on DockerHub, and base our docker-compose.yml file on the default compose file found on the associated GitHub repository. The docker-compose.yml file we will be using is pasted below:version: "3"services:zookeeper:image: zookeeperrestart: alwayscontainer_name: zookeeperhostname: zookeeperports:- 2181:2181networks:- defaultenvironment:ZOO_MY_ID: 1broker:container_name: brokerimage: wurstmeister/kafkaports:- 9092:9092networks:- defaultenvironment:KAFKA_ADVERTISED_HOST_NAME: brokerKAFKA_ZOOKEEPER_CONNECT: zookeeper:2181KAFKA_MESSAGE_MAX_BYTES: 2000000# Create topic, with 1 partition, 1 replicaKAFKA_CREATE_TOPICS: "${TWEET_TOPIC}:1:1,${WEATHER_TOPIC}:1:1,${TWITTER_SINK_TOPIC}:1:1"kafka_manager:image: hlebalbau/kafka-manager:stablecontainer_name: kakfa-managerrestart: alwaysports:- 9000:9000networks:- defaultenvironment:ZK_HOSTS: "zookeeper:2181"...<some env vars excluded for brevity>...command: -Dpidfile.path=/dev/null connect:build:context: .dockerfile: Dockerfile-connecthostname: kafka-connectcontainer_name: kafka-connectdepends_on:- zookeeper- brokerports:- "8083:8083"networks:- default- secondaryenvironment:CONNECT_BOOTSTRAP_SERVERS: 'broker:9092'CONNECT_REST_ADVERTISED_HOST_NAME: kafka-connect...<some env vars excluded for brevity>...networks:default:external:name: kafka-networksecondary:external:name: cassandra-networkThe docker-compose.yml file has been slightly modified:We have added two additional services: ‘kafka-manager’ and ‘connect’.We have specified a user-defined external bridge network, kafka-network, for the application to run in. (see explanation above)We have specified another user-defined bridge network, cassandra-network, and placed the connect service on both kafka-network and cassandra-network. This service is used to enable communication between the services isolated on the separate bridge networks.Instruction how to run the serviceWith this file defined in the root directory of our project, we can simply run ‘docker-compose -f kafka/docker-compose.yml up’, and we will have the first stage of the data pipeline up and running — the Kafka cluster!Instruction how to verify the service is working:At this point, as a simple test, you should be able to navigate to localhost:9000, and view the Kafka-manager web front end.To view the cluster here, you need to first click on “Add Cluster”. Then, give it any old name, and set the following settings shown below. Note that the “Cluster Zookeeper Hosts” section is important to match the zookeeper service we have defined in the kafka/docker-compose.yml file:At this point, you should be able to navigate to view the topics, and confirm they exist.

Read this article if you want to know more about How to Build a Distributed Big Data Pipeline Using Kafka and Docker

Although microservices can help engineers avoid many of the issues that may arise in a traditionally monolith, they do come with their own set of new challenges. You may have come across this image before depicting the differences between the two architectures:

However, in the case of a data pipeline, we believe thinking from a microservice architecture perspective is the right approach, and will explain how we tackle this problem with the help of Kafka and Docker. For a more thorough discussion about this design choice, see some of the linked articles in Further Readings section at the end of this article.

While it is true that building a fault tolerant, distributed, real time stream processing data pipeline using a microservice-based architecture may seem rather ambitious to cover in a single “How To” blog post luckily most of the heavy lifting has already been done by the smart folks over at Confluent when they originally built Apache Kafka.

Kafka is an open source event stream processing platform based on an abstraction of a distributed commit log. The aim of Kafka is to provide a unified, high-throughput, low-latency platform for managing these logs, which Kafka calls “topics”. Kafka combines three main capabilities to aid the entire process of implementing various event streaming use cases:

To publish (write) and subscribe to (read) streams of events.
To store streams of events durably and reliably for as long as you want.
To process streams of events as they occur or retrospectively.

For more information on Kafka, the best introduction is found on the official webpage. Give it a read -- it is very well written and fairly easy to digest.

Aside: Confluent Kafka or Apache Kafka?

As previously mentioned, the founders Confluent are the original developers of Apache Kafka, and offer an alternative and significantly more complete distribution of Kafka together with Confluent Platform. Many of the additional features present in Confluent’s distribution of Kafka are available to use for free, and are what Confluent refers to as “community components''. For our example application, we will be using one of these community components by Confluent, but rely on the standard Apache Kafka distribution for our underlying Kafka cluster, as it is more than capable of supporting our system. We encourage the reader to take a look at using Confluent’s distribution of Kafka if they intend on using Kafka in production, as there are many benefits to be considered which are beyond the scope of this article.

In short, Docker offers a way for developers to isolate and deliver software, together with all the dependencies required for that software to run in virtualized packages called containers. These containers are isolated from one another and also from the host machine in which they are running. However, containers allow for communication between one another and with the host through well defined channels. In contrast to virtual machines, containers all share the same resources of the host, and are thus much more light weight, and in turn are easier to distribute and replicate when needed.

We will assume that the reader is already somewhat familiar with docker, and in this post we will highlight some of the more advanced topics of using docker -- such as isolating a set of containers on a specific user-defined bridge network, and using docker-compose commands to manage services at an individual, and application focused level. For more information, see the official website, as well as some additional resources included in the Further readings section at the end of this article.

Another technology we’ll be utilizing from the Apache ecosystem is Apache Cassandra. Cassandra is an open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data and be fault tolerant.

For the sake of brevity, we’ll spare the details and just say that we’ve chosen to use Apache Cassandra mainly due to its fault tolerance and scalability capabilities. These make Cassandra an excellent fit for our slew of other big data tools we are using to build a big data pipeline. Again, for more information, you know where to go: the official webpage and the Further Readings section at the end of this article.

Overall, our data pipeline architecture is as follows:

Containerized microservices communicating over Kafka message queues

Phew. Enough with the background info. Let’s start building something already!

Each component of the application will be containerized in docker. Here is a diagram depicting the connectivity between the various services, each isolated in their own docker containers. We will be referring to this diagram throughout the step-by-step guide to try and keep things grounded.

Docker networking and container connectivity diagram

First, you need to clone the Git repository from: https://github.com/salcaino/sfucmpt733.git

Every command is prepared to be run from the parent directory in the repository. We’ll start services following the logical order from the base of the pipeline to the pipeline’s final output in terms of visualization.

The next step is to create two user-defined docker bridge networks. We will use these networks to provide an abstract separation between containers that require direct communication with the Kafka cluster, (Kafka servers and clients), and those that do not, (Cassandra and the data visualization service). For more background information on docker networking, please refer to the resources linked at the bottom of this article, and be sure to take a closer look at our different compose files that define which services reside in which networks. To define the networks, run the following commands from the top level directory of the project:

$ docker network create kafka-network$ docker network create cassandra-network

First we will bring up our database given that it does not depend logically on any other module. Cassandra is storing pre-processed data coming into the pipeline through the Kafka connect sink service and serving as a data source for our visualization tools.

How:

It runs in a docker container with a persistent volume for its data directories.
It needs to be setup with Keyspace and Tables on the first time it runs, a script called ‘bootstrap.sh’ looks for a flag file to know if it needs to kick into setup the database using CQL files. This script only runs the first time.

How to run the service:

$ docker-compose -f cassandra/docker-compose.yml up -d

This command will start Cassandra and setup everything the first time it runs.

How to verify the service is working:

Using `docker ps` should show a container called Cassandra and if you watch the container logs it should say “Startup Complete” at the end of it.

We will be using the wurstmeister/kafka-docker docker image available on DockerHub, and base our docker-compose.yml file on the default compose file found on the associated GitHub repository. The docker-compose.yml file we will be using is pasted below:

version: "3"services:zookeeper:
image: zookeeper
restart: always
container_name: zookeeper
hostname: zookeeper
ports:
- 2181:2181
networks:
- default
environment:
ZOO_MY_ID: 1broker:
container_name: broker
image: wurstmeister/kafka
ports:
- 9092:9092
networks:
- default
environment:
KAFKA_ADVERTISED_HOST_NAME: broker
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_MESSAGE_MAX_BYTES: 2000000
# Create topic, with 1 partition, 1 replica
KAFKA_CREATE_TOPICS: "${TWEET_TOPIC}:1:1,${WEATHER_TOPIC}:1:1,${TWITTER_SINK_TOPIC}:1:1"kafka_manager:
image: hlebalbau/kafka-manager:stable
container_name: kakfa-manager
restart: always
ports:
- 9000:9000
networks:
- default
environment:
ZK_HOSTS: "zookeeper:2181"
...<some env vars excluded for brevity>...
command: -Dpidfile.path=/dev/null  connect:
build:
context: .
dockerfile: Dockerfile-connect
hostname: kafka-connect
container_name: kafka-connect
depends_on:
- zookeeper
- broker
ports:
- "8083:8083"
networks:
- default
- secondary
environment:
CONNECT_BOOTSTRAP_SERVERS: 'broker:9092'
CONNECT_REST_ADVERTISED_HOST_NAME: kafka-connect
...<some env vars excluded for brevity>...
networks:
default:
external:
name: kafka-network
secondary:
external:
name: cassandra-network

The docker-compose.yml file has been slightly modified:

We have added two additional services: ‘kafka-manager’ and ‘connect’.
We have specified a user-defined external bridge network, kafka-network, for the application to run in. (see explanation above)
We have specified another user-defined bridge network, cassandra-network, and placed the connect service on both kafka-network and cassandra-network. This service is used to enable communication between the services isolated on the separate bridge networks.

Instruction how to run the service

With this file defined in the root directory of our project, we can simply run ‘docker-compose -f kafka/docker-compose.yml up’, and we will have the first stage of the data pipeline up and running — the Kafka cluster!

Instruction how to verify the service is working:

At this point, as a simple test, you should be able to navigate to localhost:9000, and view the Kafka-manager web front end.

To view the cluster here, you need to first click on “Add Cluster”. Then, give it any old name, and set the following settings shown below. Note that the “Cluster Zookeeper Hosts” section is important to match the zookeeper service we have defined in the kafka/docker-compose.yml file:

At this point, you should be able to navigate to view the topics, and confirm they exist.

python

java

cassandra

Vald

John Doe

2/11/2024

analytics

streaming

visualization

Keen - Event Streaming Platform

John Doe

2/3/2024

docker.compose

cassandra

docker

Running Apache Cassandra® Single and Multi-Node Clusters on Docker with Docker Compose

Kassian.Wren

1/23/2024

mongo

cassandra

kafka

Top 10 Real-Time Databases to Use in 2024

K.sabreena

1/5/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

cassandra

python

kafka

GitHub - princebhatt9588/Stock_Market_Real_Time_Data_Pipeline_Project_with-Apache-Kafka-and-Cassandra: This app utilizes Python, Apache Kafka, and Cassandra to fetch and process real-time stock market data, providing valuable insights for investors and traders.

princebhatt9588

12/2/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

John Doe

11/4/2022

lucene

geospatial

cassandra

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

John Doe

8/5/2022

geospatial

cassandra

kafka

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 2: Geohashes (2D)

John Doe

8/5/2022

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us