11/4/2022

Reading time:5 min

Can Spark Applications Coexist with NoSQL Databases? | Capital One

by John Doe

Apache SparkApache CassandraMongoDBThese are not unknown names in the tech industry. Each one of them has earned a commendable space in the field of distributed computing --Apache Spark as a unified analytics parallel processing framework and Apache Cassandra and MongoDB as leaders in NoSQL databases. While each of them offer great benefits -- like in-memory massive parallel processing, faster read responses, and flexible schema design -- when it comes to online transactional processing (OLTP), using them together in an application requires some tactical maneuvering.This blog will focus on tips for running Apache Spark applications on NoSQL (Apache Cassandra and MongoDB) backends. These tips are based on issues my team came across while building a cloud-native platform to process customer credit card transactions. Building and managing distributed applications on a cloud-native environment brings its own challenges. Anyone who is into distributed systems will agree TCP/IP is the life blood of distributed systems, so let’s visit an imaginary place I like to call TCP/IP sPark. I hope your time in TCP/IP sPARK helps you overcome some of those challenges...Welcome to TCP/IP sParkWho doesn't like going to theme parks? They are fun and memorable so in this blog we are going to TCP/IP sPark, which has lots of interesting rides in its famous CassandraLand and MongoLand sections. If you are like me -- someone who enjoys theme park rides and building/managing distributed applications -- follow along.We will start our journey by going to CassandraLand and learning a few lessons/tips for better ridership in TCP/IP sPark. Then we will go to MongoLand and pick up a few more. I hope you will find this journey interesting and joyous!To properly enjoy your visit and take more memories home, some refreshers on Apache Spark, Mongo and Cassandra may help.CassandraLand - Two tips for using Cassandra with SparkThe signature ride here in CassandraLand is the Token Ring Ferris Wheel. Riders go around and around and each time the wheel reaches the ground, their entry is registered in CassandraLand like so:Cassandra Lesson 1 - Cassandra key sequence mattersOh! Season pass holder DOE has gone missing after riding the Token Ring Ferris Wheel and park security is trying to find his whereabouts within CassandraLand.While querying for DOE’s details like above, our Spark application became unhappy and started spinning its own wheel. Ferris wheels > loading icons.Why did this happen? The lesson here is that the Cassandra key sequence matters while querying. By the inherent nature of how Cassandra partitions the data in disk, if key sequence is not followed, it will not be able to fetch the data quickly from its partitions. Instead it will be doing table scans for each query, straining the Cassandra database cluster.But no worries! FooBar has identified this issue and fixes it like below:It seems that DOE had gone to the gift shop to buy TCP/IP sPark souvenirs. It’s always important to get souvenirs to remember your park visit, just like it’s important to mind your Cassandra key sequences.Cassandra Lesson 2 - Use case-based data modelingAnother popular CassandraLand ride is the Partitioner Roller Coaster. If you select the correct seat(partition key), you will get back to the base station in exhilarating milliseconds. While customers are enjoying their roller coaster ride, their information is persisted like below so the ride operators can track who has ridden it each day:This is important because like most roller coasters, the Partitioner Roller Coaster provides lockers for keeping your things in while you’re on the ride. But some customers might like to keep their things close by, or are in a hurry to board the ride and forget to use the lockers. If you’re one of these people, it’s easy for things to fall out of your pockets, or to be so excited to move on to another ride that you forget your things.So whenever the roller coaster operator finds something under the coaster or in one of the seats there arises a need to find all the riders processed that day.While it is possible to find the information in the Customer Schema, it is not the optimal way to do so. In Cassandra this is considered something of an anti-pattern. Cassandra Data is partitioned and its use case defines the schema.The lesson here is to design your schema based on the use case in the first place, and in case it’s needed, the data producer has to duplicate the data as usage evolves.MongoLand - Two Tips for using MongoDB with SparkAfter gathering some lessons and tips from CassandraLand, our park visitors are heading to the much awaited MongoLand.While customers are going round and round on the Schemaless Carousel, their information is persisted in the backend like so:Mongo Lesson 1 - Manage MongoDB connections properlyFrom our visit to CassandraLand, we know DOE is a season pass holder to TCP/IP sPark. They have been bumped up into a higher membership tier and we need to update this in the system. But in order to process this information update, the carousel has to stop, and the other riders are unhappy that their ride is being slowed down and interrupted.The lesson here is that Mongo Connections should be handled at JVM or partition level like below:If connections are not handled at the partition or JVM level, there is the possibility that your application may open lots of unwanted connections depending on where you do it. This has the potential to bring down your application and database cluster, as well making the other carousel riders unhappy.Mongo Lesson 2 - Indexes are very helpfulAfter recovering from the Schemaless Carousel debacle, we still haven't processed DOE’s membership update.While we attempt to do so, we get the spinning ferris wheel -- or in this case, the carousel --- again. Just like in our visit to CassandraLand.The lesson here is that Mongo Indexes are very important and helpful in cases where you need to find and update information such as DOE’s new park membership status.Hope your visit to TCP/IP sPark has been enjoyable!Hope your journey to TCP/IP SPark, CassandraLand, and MongoLand were beneficial and memorable. Remember that when working with Cassandra and Spark you should ensure your key sequence is correct and schema is planned as per your use case. And don’t forget that with MongoDB and Spark you should manage your connection properly, and indexes in particular, in case of specific updates.See you another time in the magic land of TCP/IP sPark!

Read this article if you want to know more about Can Spark Applications Coexist with NoSQL Databases? | Capital One

Apache Spark

Apache Cassandra

MongoDB

These are not unknown names in the tech industry. Each one of them has earned a commendable space in the field of distributed computing --Apache Spark as a unified analytics parallel processing framework and Apache Cassandra and MongoDB as leaders in NoSQL databases. While each of them offer great benefits -- like in-memory massive parallel processing, faster read responses, and flexible schema design -- when it comes to online transactional processing (OLTP), using them together in an application requires some tactical maneuvering.

This blog will focus on tips for running Apache Spark applications on NoSQL (Apache Cassandra and MongoDB) backends. These tips are based on issues my team came across while building a cloud-native platform to process customer credit card transactions. Building and managing distributed applications on a cloud-native environment brings its own challenges. Anyone who is into distributed systems will agree TCP/IP is the life blood of distributed systems, so let’s visit an imaginary place I like to call TCP/IP sPark. I hope your time in TCP/IP sPARK helps you overcome some of those challenges...

TCP/IP sPark is name of our theme park where audience will immerse in experience for learning tips for Apache Spark application

Welcome to TCP/IP sPark

Who doesn't like going to theme parks? They are fun and memorable so in this blog we are going to TCP/IP sPark, which has lots of interesting rides in its famous CassandraLand and MongoLand sections. If you are like me -- someone who enjoys theme park rides and building/managing distributed applications -- follow along.

We will start our journey by going to CassandraLand and learning a few lessons/tips for better ridership in TCP/IP sPark. Then we will go to MongoLand and pick up a few more. I hope you will find this journey interesting and joyous!

To properly enjoy your visit and take more memories home, some refreshers on Apache Spark, Mongo and Cassandra may help.

CassandraLand - Two tips for using Cassandra with Spark

Token Ring Ferris Wheel in CassandraLand explains first tip for Apache Spark and Apache Cassandra

The signature ride here in CassandraLand is the Token Ring Ferris Wheel. Riders go around and around and each time the wheel reaches the ground, their entry is registered in CassandraLand like so:

Schema Definition of sample customer table in Apache Cassandra for CassandraLand

Cassandra Lesson 1 - Cassandra key sequence matters

Oh! Season pass holder DOE has gone missing after riding the Token Ring Ferris Wheel and park security is trying to find his whereabouts within CassandraLand.

Sample Query which was used to fetch Season pass holder Doe's where abouts after Token Ring Ferris Wheel in CassandraLand which was not working

While querying for DOE’s details like above, our Spark application became unhappy and started spinning its own wheel. Ferris wheels > loading icons.

Query was not returning expected results and it started spinning wheel

Why did this happen? The lesson here is that the Cassandra key sequence matters while querying. By the inherent nature of how Cassandra partitions the data in disk, if key sequence is not followed, it will not be able to fetch the data quickly from its partitions. Instead it will be doing table scans for each query, straining the Cassandra database cluster.

But no worries! FooBar has identified this issue and fixes it like below:

Corrected Query which was used to fetch Season pass holder Doe's where abouts after Token Ring Ferris Wheel in CassandraLand

It seems that DOE had gone to the gift shop to buy TCP/IP sPark souvenirs. It’s always important to get souvenirs to remember your park visit, just like it’s important to mind your Cassandra key sequences.

Cassandra Lesson 2 - Use case-based data modeling

Partition Rollercoaster in CassandraLand explains second tip for Apache Spark and Apache Cassandra

Another popular CassandraLand ride is the Partitioner Roller Coaster. If you select the correct seat(partition key), you will get back to the base station in exhilarating milliseconds. While customers are enjoying their roller coaster ride, their information is persisted like below so the ride operators can track who has ridden it each day:

This is important because like most roller coasters, the Partitioner Roller Coaster provides lockers for keeping your things in while you’re on the ride. But some customers might like to keep their things close by, or are in a hurry to board the ride and forget to use the lockers. If you’re one of these people, it’s easy for things to fall out of your pockets, or to be so excited to move on to another ride that you forget your things.

So whenever the roller coaster operator finds something under the coaster or in one of the seats there arises a need to find all the riders processed that day.

While it is possible to find the information in the Customer Schema, it is not the optimal way to do so. In Cassandra this is considered something of an anti-pattern. Cassandra Data is partitioned and its use case defines the schema.

The lesson here is to design your schema based on the use case in the first place, and in case it’s needed, the data producer has to duplicate the data as usage evolves.

MongoLand - Two Tips for using MongoDB with Spark

Welcoming audience to MongoLand from CassandraLand in TCP/IP sPark

After gathering some lessons and tips from CassandraLand, our park visitors are heading to the much awaited MongoLand.

While customers are going round and round on the Schemaless Carousel, their information is persisted in the backend like so:

Mongo Lesson 1 - Manage MongoDB connections properly

From our visit to CassandraLand, we know DOE is a season pass holder to TCP/IP sPark. They have been bumped up into a higher membership tier and we need to update this in the system. But in order to process this information update, the carousel has to stop, and the other riders are unhappy that their ride is being slowed down and interrupted.

The lesson here is that Mongo Connections should be handled at JVM or partition level like below:

How to optimally use Mongo Connections in Apache Spark For Each operation

If connections are not handled at the partition or JVM level, there is the possibility that your application may open lots of unwanted connections depending on where you do it. This has the potential to bring down your application and database cluster, as well making the other carousel riders unhappy.

Mongo Lesson 2 - Indexes are very helpful

After recovering from the Schemaless Carousel debacle, we still haven't processed DOE’s membership update.

Sample MongoDB query to update DOE's membership status

While we attempt to do so, we get the spinning ferris wheel -- or in this case, the carousel --- again. Just like in our visit to CassandraLand.

The lesson here is that Mongo Indexes are very important and helpful in cases where you need to find and update information such as DOE’s new park membership status.

Hope your visit to TCP/IP sPark has been enjoyable!

Hope your journey to TCP/IP SPark, CassandraLand, and MongoLand were beneficial and memorable. Remember that when working with Cassandra and Spark you should ensure your key sequence is correct and schema is planned as per your use case. And don’t forget that with MongoDB and Spark you should manage your connection properly, and indexes in particular, in case of specific updates.

See you another time in the magic land of TCP/IP sPark!

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

apache

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

DataStax

8/3/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

John Doe

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

John Doe

2/17/2023

cassandra

spark

kafka

Apache Cassandra Lunch #84: Data & Analytics Platform: Cassandra, Spark, Kafka

John Doe

11/4/2022

proxy

cassandra

spark

Migrate to Azure Managed Instance for Apache Cassandra using Apache Spark

TheovanKraay

8/18/2022

datastax

cassandra

spark

Apache Cassandra Lunch #72: Databricks and Cassandra - Business Platform Team

Arpan Patel

6/28/2022

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

Welcome to TCP/IP sPark

CassandraLand - Two tips for using Cassandra with Spark

Cassandra Lesson 1 - Cassandra key sequence matters

Cassandra Lesson 2 - Use case-based data modeling

MongoLand - Two Tips for using MongoDB with Spark

Mongo Lesson 1 - Manage MongoDB connections properly

Mongo Lesson 2 - Indexes are very helpful

Hope your visit to TCP/IP sPark has been enjoyable!

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us