8/3/2024

Reading time:10 min

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

by DataStax

Author: Cédrick LunvenDataStax·FollowPublished inBuilding Real-World, Real-Time AI·9 min read·May 27, 2022--Knowing how to construct event-driven architectures is a crucial skill for developers as enterprises are relying on real-time data to drive business growth. In this post, we show you how to build a full event-driven toolkit with highly-scalable technologies like Apache Kafka™, Apache Spark™, and Apache Cassandra®.Event-driven architectures (EDAs) are software patterns that enable organizations to detect “events”, significant changes in a state or an update, and respond to them in real-time or near real-time. In contrast to the traditional “request/response” architecture, applications built with EDAs provide faster response times, a more seamless user experience, and better scalability without blocked thread waiting.DataStax recently collaborated with Rahul Singh, CEO of Anant Corporation and creator of Cassandra.Link, a knowledge base of all things Cassandra, to produce a 3-part series on building an Event Driven Toolkit for Apache Cassandra®, Apache Spark™, and Apache Kafka™.In Part 1, we discussed how REST relates to event-driven systems and created a REST API for Cassandra. In Part 2, we explored different ways to event source information for a particular event with Kafka and connected it to Cassandra with Kafka Connect and Kafka Streams. In this post, we connect Kafka to Cassandra with Spark Streaming and process the data once it comes into Cassandra.Kafka and Cassandra are a dynamic-duo in microservice architectures. Kafka fits naturally as a distributed queue for event-driven landscapes and acts as a buffer layer to transport messages to the database and surrounding technologies.Cassandra scales linearly by just adding more nodes, making it an excellent persistent data storage choice for microservices applications. When combined with Spark Streaming, an equally scalable, high-throughput, and fault-tolerant streaming processing system, it creates a robust event-driven toolkit.This content has been built before the integration of Apache Pulsar™, another queuing system in DataStax Astra DB, Cassandra-as-a-service platform. Everything that you’ll do in this post can be implemented in the same way with Pulsar.By the end of this post, you’ll have a basic understanding of Spark Streaming in the context of Cassandra and Kafka, and be an expert at running streaming jobs and Spark batches. Ultimately, you’ll build a full event-driven data toolkit.About Cassandra API for Leaves platformFigure 1. Leaves Platform.In the previous workshop, we created a Cassandra API GitPod for Anant Corporation’s Leaves platform which is used to generate their Cassandra.link website. Leaves is a knowledge-curation platform with an admin screen, a MySQL and PHP advanced view, and a mirror of the data in Cassandra and Solr.Anant Corporation built the Cassandra API to make this platform scalable and serverless. The front-end of this application employs JAMstack and Netlify, a web hosting and automation platform to accelerate development productivity by hosting interfaces that can both talk to APIs but also generate a full website using Gatsby. All 1500+ pages on Casssandra.link get generated but run off an API.The plan for Anant Corporation is to stop managing the API themselves and migrate it to Astra DB where they can take advantage of a wide range of APIs, such as API in GraphQL or API Ingress. Anant also wants to make the platform event driven. After data comes in event-driven and has been updated, they will process the data and analyze the Cassandra.link website to see correlations using machine learning.There are currently two APIs–Cassandra.API, and a parity between Leaves API Python and Leaves API Node. Anant Corporation was one of the first users of Astra DB and back then, they needed to create another API separate from the Cassandra API to run some custom scrapping. Watch this YouTube video for a detailed breakdown of the two APIs.If you didn’t catch the previous workshop, you can create the Cassandra API from scratch with this step-by-step guide or initialize the ready-made GitPod on GitHub. The first part of this series focused on using Kafka as a broker, a registry, and a REST proxy. Now, we’ll cover the stream process that transfers data from Kafka into Cassandra, and a batch process that reads from Cassandra, processes it, and puts it back into Cassandra using Spark Streaming.Before we do, let’s understand more about how REST, microservices, and Event Driven Architectures relate to each other.REST vs. Microservices vs. Event Driven ArchitectureMicroservices are loosely coupled services with a database for each service. This can be implemented synchronously using REST or asynchronously using AMQP protocol, Kafka, or Pulsar. Microservices run their services as an independent application, and these autonomous functions connect to REST APIs, which work to configure larger applications.Generally, each microservice requires its own database to avoid any resource sharing and coupling between the services. But this isn’t true on Cassandra, because Cassandra can scale to hundreds of thousands of services and servers. A keyspace or a table on Cassandra works as its own kind of microservice that scales independently.On Astra DB especially, you won’t have to worry about using a different database, keyspace, or table per microservice because it can scale infinitely. When you want to add new functionality, simply create another table instead of a whole database and have one cluster that contains several microservices powered by different keyspaces.The DataStax Enterprise ecosystem also offers additional features, such as bringing data into Cassandra and retrieving it out as JSON or adding data using DSE Graph 6.8 and retrieving data using Cassandra Query Language (CQL).Figure 2. How each keyspace on Cassandra works as a microservice.In short, Cassandra is awesome for microservices because you can split your microservices by data center, key space, table, or query. When you execute a query against Cassandra, the load will be distributed among the nodes for you so there’s no coupling among different microservices. DataStax’s Software Engineer, Jeff Carpenter, explains this in more detail in this book.Event sourcing and Command and Query Responsibility Segregation (CQRS) are software patterns that people implement with event-driven applications and microservices. CQRS is a method to scale systems so when an event or an update comes, the processor saves that data to different places–one for the event itself, and another for where the data’s going to be queried.For example, if you were using DataStax but you don’t have a built-in search with DSE search, and you needed to materialize that data in both Cassandra and ElasticSearch, your event processor would take the event and save it in both places.With CQRS, updating and retrieving data are seen as two different types of requests. CQRS uses commands to update data, and queries to read data. Figure 3 illustrates the architecture of an event-driven CQRS application. The commands write updates to Kafka in the corresponding topics while Kafka Streams creates projections via aggregations or joins of the data in the topics on the query side. Using event sourcing, we can send one event, process it, and save that data into several different places.Figure 3. Architecture of a CQRS application. Source: IBM.Before we get into the hands-on exercises, let’s learn more about the technologies that you’ll be working with.What is Apache Spark?Figure 4. Apache Spark ecosystem.Apache Spark is a unified analytics engine built on top of Apache Spark Core. it includes a collection of technologies, such as:Spark SQL: a hive-compliant language for structured data processing on Spark. Coupled with Spark Streaming, you can perform joins, and create batches with different events and queues with Spark SQL.Spark Streaming: a scalable, high-throughput, fault-tolerant stream processing of life data streams. Spark Streaming transforms real-time data from various sources like Kafka, Flume, and Amazon Kinesis, using complex algorithms and delivers the processed data to file systems, databases, and live dashboards. There are two kinds of streaming–basic Spark Streaming and Structured Spark Streaming. Structured streaming gives you a schema so you can run Spark SQL transformations on the data coming in from the event before you send it through. You can also express your live datastream as a static table.Machine Learning Library (MLlib): built-in library to make practical machine learning scalable and easy. There are MLlib extensions that you can run on top of Spark.GraphX: an API for graphs and graph-parallel computation on top of Spark.As a unified analytics engine, Spark can talk to any datastore that you’re looking to connect your data with. In other words, you can import data to Spark, export data to different systems, and present this data in a similar dataset format.When properly configured, a single Spark job can run on a computer, or hundreds of computers. Due to this scalability, Spark fits in nicely as Cassandra and Kafka’s best friend in scalable data processing. On DataStax Enterprise, you can have both Kafka and Spark on the same node and scale both of them at the same time.Figure 5 illustrates a Spark cluster architecture which consists of a driver called SparkConnect, a cluster manager that allocates resources, and different worker nodes called “Executor”, “Cache”, and “Tasks”. This YouTube video explains the architecture in detail.Figure 5. Spark cluster architecture.Spark is well-known for large-scale processing, machine learning, and analytics. Internet powerhouses like Uber, Netflix, eBay, and Coniva use it for the following:Streaming Extract, Transform, Load (ETL)Data enrichmentTrigger event detectionComplex session analysisMachine learningFog computingWhat is Astra DB?Astra DB is a data platform as a service in the cloud built on the infinitely scalable Apache Cassandra with a selection of tools, such as APIs, to help you build applications on top of Cassandra. Astra eliminates operations and reduces deployment time from months to minutes as everything, from provisioning to backups, is fully automated. What’s more, you can instantly create a Cassandra database through Astra DB for free for 5 GB forever–no credit cards required.Astra DB secures your data with the most advanced security available for Cassandra. Through auto-configured developer tools that you can deploy with a few clicks, Astra DB greatly simplifies app development.Astra can run on any cloud and spin instances on any region you like. On top of the database, there are tools like REST GraphQL, CQL Console, DataStax Studio, and Data Loader. If you want to work with Kubernetes, our K8ssandra initiative on Astra DB is all you need to connect Cassandra with Kubernetes. With the service broker, you can tell K8ssandra to spin your Cassandra instance directly into Astra DB.Hands-on exercise overviewNow that you’re familiar with the technologies, let’s get started on the hands-on workshop. You’ll first read data from Kafka in a structured stream, select information, and materialize a new dataset in Cassandra. Then, you’ll run a batch job to take all the data from Cassandra, crunch it, and save it back into another table.Follow along with this YouTube tutorial, and get codes from this GitHub repository. You won’t need to download or run any codes on your computer; everything you need to run a Spark job is serverless. Click on the links below to get started!Create a Cassandra database on Astra DBOpen Cassandra.API in GitPodStart and setup Apache KafkaConsume data from Kafka and write to CassandraRun Apache Spark jobs against Astra DBConclusionIn this post we delved into the basics of the scalable and fault-tolerant event-driven architecture–Kafka, Cassandra, and Spark–and how you can process heavy real-time data and analyze them in databases and live dashboards.Once you’ve mastered the basics, you can try posting your data on different analytics platforms. For more Cassandra and Kafka workshops, check out DataStax Developers on YouTube. If you have any questions about Cassandra, post them on our DataStax Community — the Cassandra stack overflow.Follow the DataStax Tech Blog on Medium for more developer stories. Follow DataStax Developers on Twitter for the latest news about our developer community.ResourcesAstra DB: the multi-cloud database-as-a-serviceCreate your Cassandra database on Astra DBApache Cassandra®Apache Spark™Apache Kafka™.Apache Pulsar™Anant CorporationKafka ConnectKafka StreamsSpark StreamingDataStax EnterpriseDataStax Graph DocumentationYouTube Workshop Part 1: Build a REST API with Apache CassandraYouTube Workshop Part 2: Cassandra.API CRUD UIYouTube Workshop Part 3: Running a Spark Job on Apache CassandraGitHub: Cassandra APIGitHub: Cassandra in Real-TimeDefinitive Guide for Apache Cassandra 4.0

Read this article if you want to know more about Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

Author: Cédrick Lunven

DataStax

·

Follow

Published in

Building Real-World, Real-Time AI

·

9 min readMay 27, 2022

--

Knowing how to construct event-driven architectures is a crucial skill for developers as enterprises are relying on real-time data to drive business growth. In this post, we show you how to build a full event-driven toolkit with highly-scalable technologies like Apache Kafka™, Apache Spark™, and Apache Cassandra®.

Event-driven architectures (EDAs) are software patterns that enable organizations to detect “events”, significant changes in a state or an update, and respond to them in real-time or near real-time. In contrast to the traditional “request/response” architecture, applications built with EDAs provide faster response times, a more seamless user experience, and better scalability without blocked thread waiting.

DataStax recently collaborated with Rahul Singh, CEO of Anant Corporation and creator of Cassandra.Link, a knowledge base of all things Cassandra, to produce a 3-part series on building an Event Driven Toolkit for Apache Cassandra®, Apache Spark™, and Apache Kafka™.

In Part 1, we discussed how REST relates to event-driven systems and created a REST API for Cassandra. In Part 2, we explored different ways to event source information for a particular event with Kafka and connected it to Cassandra with Kafka Connect and Kafka Streams. In this post, we connect Kafka to Cassandra with Spark Streaming and process the data once it comes into Cassandra.

Kafka and Cassandra are a dynamic-duo in microservice architectures. Kafka fits naturally as a distributed queue for event-driven landscapes and acts as a buffer layer to transport messages to the database and surrounding technologies.

Cassandra scales linearly by just adding more nodes, making it an excellent persistent data storage choice for microservices applications. When combined with Spark Streaming, an equally scalable, high-throughput, and fault-tolerant streaming processing system, it creates a robust event-driven toolkit.

This content has been built before the integration of Apache Pulsar™, another queuing system in DataStax Astra DB, Cassandra-as-a-service platform. Everything that you’ll do in this post can be implemented in the same way with Pulsar.

By the end of this post, you’ll have a basic understanding of Spark Streaming in the context of Cassandra and Kafka, and be an expert at running streaming jobs and Spark batches. Ultimately, you’ll build a full event-driven data toolkit.

About Cassandra API for Leaves platform

Figure 1. Leaves Platform.

In the previous workshop, we created a Cassandra API GitPod for Anant Corporation’s Leaves platform which is used to generate their Cassandra.link website. Leaves is a knowledge-curation platform with an admin screen, a MySQL and PHP advanced view, and a mirror of the data in Cassandra and Solr.

Anant Corporation built the Cassandra API to make this platform scalable and serverless. The front-end of this application employs JAMstack and Netlify, a web hosting and automation platform to accelerate development productivity by hosting interfaces that can both talk to APIs but also generate a full website using Gatsby. All 1500+ pages on Casssandra.link get generated but run off an API.

The plan for Anant Corporation is to stop managing the API themselves and migrate it to Astra DB where they can take advantage of a wide range of APIs, such as API in GraphQL or API Ingress. Anant also wants to make the platform event driven. After data comes in event-driven and has been updated, they will process the data and analyze the Cassandra.link website to see correlations using machine learning.

There are currently two APIs–Cassandra.API, and a parity between Leaves API Python and Leaves API Node. Anant Corporation was one of the first users of Astra DB and back then, they needed to create another API separate from the Cassandra API to run some custom scrapping. Watch this YouTube video for a detailed breakdown of the two APIs.

If you didn’t catch the previous workshop, you can create the Cassandra API from scratch with this step-by-step guide or initialize the ready-made GitPod on GitHub. The first part of this series focused on using Kafka as a broker, a registry, and a REST proxy. Now, we’ll cover the stream process that transfers data from Kafka into Cassandra, and a batch process that reads from Cassandra, processes it, and puts it back into Cassandra using Spark Streaming.

Before we do, let’s understand more about how REST, microservices, and Event Driven Architectures relate to each other.

REST vs. Microservices vs. Event Driven Architecture

Microservices are loosely coupled services with a database for each service. This can be implemented synchronously using REST or asynchronously using AMQP protocol, Kafka, or Pulsar. Microservices run their services as an independent application, and these autonomous functions connect to REST APIs, which work to configure larger applications.

Generally, each microservice requires its own database to avoid any resource sharing and coupling between the services. But this isn’t true on Cassandra, because Cassandra can scale to hundreds of thousands of services and servers. A keyspace or a table on Cassandra works as its own kind of microservice that scales independently.

On Astra DB especially, you won’t have to worry about using a different database, keyspace, or table per microservice because it can scale infinitely. When you want to add new functionality, simply create another table instead of a whole database and have one cluster that contains several microservices powered by different keyspaces.

The DataStax Enterprise ecosystem also offers additional features, such as bringing data into Cassandra and retrieving it out as JSON or adding data using DSE Graph 6.8 and retrieving data using Cassandra Query Language (CQL).

Figure 2. How each keyspace on Cassandra works as a microservice.

In short, Cassandra is awesome for microservices because you can split your microservices by data center, key space, table, or query. When you execute a query against Cassandra, the load will be distributed among the nodes for you so there’s no coupling among different microservices. DataStax’s Software Engineer, Jeff Carpenter, explains this in more detail in this book.

Event sourcing and Command and Query Responsibility Segregation (CQRS) are software patterns that people implement with event-driven applications and microservices. CQRS is a method to scale systems so when an event or an update comes, the processor saves that data to different places–one for the event itself, and another for where the data’s going to be queried.

For example, if you were using DataStax but you don’t have a built-in search with DSE search, and you needed to materialize that data in both Cassandra and ElasticSearch, your event processor would take the event and save it in both places.

With CQRS, updating and retrieving data are seen as two different types of requests. CQRS uses commands to update data, and queries to read data. Figure 3 illustrates the architecture of an event-driven CQRS application. The commands write updates to Kafka in the corresponding topics while Kafka Streams creates projections via aggregations or joins of the data in the topics on the query side. Using event sourcing, we can send one event, process it, and save that data into several different places.

Figure 3. Architecture of a CQRS application. Source: IBM.

Before we get into the hands-on exercises, let’s learn more about the technologies that you’ll be working with.

What is Apache Spark?

Figure 4. Apache Spark ecosystem.

Apache Spark is a unified analytics engine built on top of Apache Spark Core. it includes a collection of technologies, such as:

Spark SQL: a hive-compliant language for structured data processing on Spark. Coupled with Spark Streaming, you can perform joins, and create batches with different events and queues with Spark SQL.
Spark Streaming: a scalable, high-throughput, fault-tolerant stream processing of life data streams. Spark Streaming transforms real-time data from various sources like Kafka, Flume, and Amazon Kinesis, using complex algorithms and delivers the processed data to file systems, databases, and live dashboards.
There are two kinds of streaming–basic Spark Streaming and Structured Spark Streaming. Structured streaming gives you a schema so you can run Spark SQL transformations on the data coming in from the event before you send it through. You can also express your live datastream as a static table.
Machine Learning Library (MLlib): built-in library to make practical machine learning scalable and easy. There are MLlib extensions that you can run on top of Spark.
GraphX: an API for graphs and graph-parallel computation on top of Spark.

As a unified analytics engine, Spark can talk to any datastore that you’re looking to connect your data with. In other words, you can import data to Spark, export data to different systems, and present this data in a similar dataset format.

When properly configured, a single Spark job can run on a computer, or hundreds of computers. Due to this scalability, Spark fits in nicely as Cassandra and Kafka’s best friend in scalable data processing. On DataStax Enterprise, you can have both Kafka and Spark on the same node and scale both of them at the same time.

Figure 5 illustrates a Spark cluster architecture which consists of a driver called SparkConnect, a cluster manager that allocates resources, and different worker nodes called “Executor”, “Cache”, and “Tasks”. This YouTube video explains the architecture in detail.

Figure 5. Spark cluster architecture.

Spark is well-known for large-scale processing, machine learning, and analytics. Internet powerhouses like Uber, Netflix, eBay, and Coniva use it for the following:

Streaming Extract, Transform, Load (ETL)
Data enrichment
Trigger event detection
Complex session analysis
Machine learning
Fog computing

What is Astra DB?

Astra DB is a data platform as a service in the cloud built on the infinitely scalable Apache Cassandra with a selection of tools, such as APIs, to help you build applications on top of Cassandra. Astra eliminates operations and reduces deployment time from months to minutes as everything, from provisioning to backups, is fully automated. What’s more, you can instantly create a Cassandra database through Astra DB for free for 5 GB forever–no credit cards required.

Astra DB secures your data with the most advanced security available for Cassandra. Through auto-configured developer tools that you can deploy with a few clicks, Astra DB greatly simplifies app development.

Astra can run on any cloud and spin instances on any region you like. On top of the database, there are tools like REST GraphQL, CQL Console, DataStax Studio, and Data Loader. If you want to work with Kubernetes, our K8ssandra initiative on Astra DB is all you need to connect Cassandra with Kubernetes. With the service broker, you can tell K8ssandra to spin your Cassandra instance directly into Astra DB.

Hands-on exercise overview

Now that you’re familiar with the technologies, let’s get started on the hands-on workshop. You’ll first read data from Kafka in a structured stream, select information, and materialize a new dataset in Cassandra. Then, you’ll run a batch job to take all the data from Cassandra, crunch it, and save it back into another table.

Follow along with this YouTube tutorial, and get codes from this GitHub repository. You won’t need to download or run any codes on your computer; everything you need to run a Spark job is serverless. Click on the links below to get started!

Conclusion

In this post we delved into the basics of the scalable and fault-tolerant event-driven architecture–Kafka, Cassandra, and Spark–and how you can process heavy real-time data and analyze them in databases and live dashboards.

Once you’ve mastered the basics, you can try posting your data on different analytics platforms. For more Cassandra and Kafka workshops, check out DataStax Developers on YouTube. If you have any questions about Cassandra, post them on our DataStax Community — the Cassandra stack overflow.

Follow the DataStax Tech Blog on Medium for more developer stories. Follow DataStax Developers on Twitter for the latest news about our developer community.

Resources

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Author: Cédrick Lunven

About Cassandra API for Leaves platform

REST vs. Microservices vs. Event Driven Architecture

What is Apache Spark?

What is Astra DB?

Hands-on exercise overview

Conclusion

Resources

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us