1/15/2018

Reading time:6 min

SMACK (Spark, Mesos, Akka, Cassandra and Kafka) & Fast Data – Cuelogic Blog

by John Doe

 by Rajdeep Limbu, June 13, 2017  Category: Analytics, Big Data, Big Data Analytics, Fast Data | Tag: Akka, big data, Cassandra, fast data, hadoop, Kafka, Lambda Architecture, MesosAs enterprises strive to realize and identify the value in Big Data, many now seek more agile and capable analytic systems. Some of the key factors supporting this include improving customer retention, influence product development and quality and increasing operational efficiencies among others. Organizations are looking to enhance their analytics workloads to drive several real-time cases such as recommendation engines, fraud detection and advertising analytics. In this whitepaper, I will try to draw your attention toward the applications of SMACK and how it has emerged as a crucial open source technology to handle high volume and velocity of “Big Data”. Big Data is not only Big it’s Fast alsoThe way that big data gets big is through a continuous and constant steam of incoming data. In high-volume environments, that information/data arrives at incredible rates, yet still needs to be stored and analyzed.Less than a dozen years ago, it was nearly impossible to analyze and process those large chunks of data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost everywhere to analyze petabytes of historical data. Apache Hadoop is a software framework that is being adopted by many enterprises as a cost-effective analytics platform for Big Data analytics. Hadoop is an ecosystem of several services rather than a single product, and is designed for storing and processing petabytes of data in a linear scale-out model. Each service in the Hadoop ecosystem may use different technologies in the processing pipeline to ingest, store, process and visualize data. What’s next After Big Data?What Hadoop Can’t DoThe big data movement was pretty much driven by the demand for scale in volume, variety and velocity of data. This is often supported by data that is generated at high speeds, such as financial ticker data, click-stream data, sensor data or log aggregation. Certainly, Apache Hadoop is one of the best technologies to analyze all those large amount of data. However, the technology has a limitation as it cannot be used when the data is moving or in motion. Some of the areas where the use of Hadoop is not recommended are transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.Nor would it be optimal for structured data sets that require very minimal latency, like when a Web site is served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.Fast Data – Analyze data when they are in motion or in real-timeSMACK stands for Spark, Mesos, Akka, Cassandra and Kafka – a combination that’s being adopted for ‘fast data’.The SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka) is known to be as the ideal platform for constructing “fast data” applications. The term fast data underlines how big data applications and architectures are evolving to be stream oriented, so that the data is extracted as quickly as possible from incoming data, while still supporting traditional data scenarios such as batch processing, data warehousing, and interactive exploration. The SMACK StackThe SMACK Stack architecture pattern consists of following technologies:Apache Spark (Processing Engine) – Fast and general engine for distributed, large-scale data processing. Often, used to implement ETL, queries, aggregations and applications of machine learning etc. Spark is well-suited to machine learning algorithms, and exposes its API in Scala, Python, R and Java. The approach of Spark is to provide a unified interface that can be used to mix SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.Apache Mesos (The Container) – A flexible cluster resource management system that provides efficient resource isolation and sharing across distributed applications. Mesos uses a two-level scheduling mechanism where resource offers are made to frameworks (applications that run on top of Mesos). The Mesos master node decides how many resources to offer each framework, while each framework determines the resources it accepts and what application to execute on those resources. This method of resource allocation allows near-optimal data locality when sharing a cluster of nodes amongst diverse frameworks.Akka (The Model) – Microservice development with high scalability, durability, and low-latency processing. A toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM (Java Virtual Machine). The actor model in computer science is a mathematical model of concurrent computation that treats “actors” as the universal primitives of concurrent computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received.Apache Cassandra (The Storage) – Scalable, resilient, distributed database designed for persistent, durable storage across multiple data centers. Most environments will also use a distributed file system like HDFS or S3.Apache Kafka (The Broker) – A high-throughput, low-latency distributed messaging system designed for handling real-time data feeds. In Kafka, data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.The SMACK stack helps in simplifying the processing of both real-time and batch workloads as single data stream in real or near real time.SMACK stack: How to put the pieces togetherCapturing value in fast dataThe best way to capture the value of incoming data is to react to it the instant it arrives. Processing of incoming data in batches always takes lot of time which ultimately takes away the value of that data.Basically, to handle or process data which is arriving at thousands to millions of events per second, two technologies are required i.e. a streaming system capable of delivering events as fast as they come in and a data store capable of processing each item as fast as it arrives. Delivering the fast data In order to deliver fast data, two popular streaming systems i.e. Apache Storm and Apache Kafka have evolved in the market over the past few years. Apache Storm (developed by Twitter) reliably processes unbounded streams of data at rates of millions of messages per second. On the other hand, Kafka (developed by LinkedIn), is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.Processing the fast data It is noteworthy that Hadoop MapReduce has very well solved the batch processing needs of customers across the industries. However, the introduction of more flexible developed big data tools for fast data processing paved the demand of big data darling Apache Spark. Owing to its fast data processing power, the technology is setting fire across the entire big data world. The technology has actually taken over Hadoop in terms of fast data processing of interactive data mining algorithms and iterative machine learning algorithms.Lambda Architecture is data processing architecture and has both stream and batch processing capabilities. However, most of the lambdas solutions cannot serve the two needs at the same time:Handles a massive data stream in real time.Handles different and multiple data models from multiple data source.To easily meet both of these needs, we have Apache Spark which is mainly responsible for real-time analysis of both recent and historical information. The analysis results are often persisted in Apache Cassandra which helps the user to extract the real-time information anytime in case of failure. Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries.

Read this article if you want to know more about SMACK (Spark, Mesos, Akka, Cassandra and Kafka) & Fast Data – Cuelogic Blog

by Rajdeep Limbu, June 13, 2017

Category: Analytics, Big Data, Big Data Analytics, Fast Data | Tag: Akka, big data, Cassandra, fast data, hadoop, Kafka, Lambda Architecture, Mesos

As enterprises strive to realize and identify the value in Big Data, many now seek more agile and capable analytic systems. Some of the key factors supporting this include improving customer retention, influence product development and quality and increasing operational efficiencies among others. Organizations are looking to enhance their analytics workloads to drive several real-time cases such as recommendation engines, fraud detection and advertising analytics. In this whitepaper, I will try to draw your attention toward the applications of SMACK and how it has emerged as a crucial open source technology to handle high volume and velocity of “Big Data”.

Big Data is not only Big it’s Fast also

The way that big data gets big is through a continuous and constant steam of incoming data. In high-volume environments, that information/data arrives at incredible rates, yet still needs to be stored and analyzed.

Less than a dozen years ago, it was nearly impossible to analyze and process those large chunks of data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost everywhere to analyze petabytes of historical data. Apache Hadoop is a software framework that is being adopted by many enterprises as a cost-effective analytics platform for Big Data analytics. Hadoop is an ecosystem of several services rather than a single product, and is designed for storing and processing petabytes of data in a linear scale-out model. Each service in the Hadoop ecosystem may use different technologies in the processing pipeline to ingest, store, process and visualize data.

What’s next After Big Data?

What Hadoop Can’t Do

The big data movement was pretty much driven by the demand for scale in volume, variety and velocity of data. This is often supported by data that is generated at high speeds, such as financial ticker data, click-stream data, sensor data or log aggregation. Certainly, Apache Hadoop is one of the best technologies to analyze all those large amount of data. However, the technology has a limitation as it cannot be used when the data is moving or in motion. Some of the areas where the use of Hadoop is not recommended are transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.

Nor would it be optimal for structured data sets that require very minimal latency, like when a Web site is served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.

Fast Data – Analyze data when they are in motion or in real-time

SMACK stands for Spark, Mesos, Akka, Cassandra and Kafka – a combination that’s being adopted for ‘fast data’.

The SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka) is known to be as the ideal platform for constructing “fast data” applications. The term fast data underlines how big data applications and architectures are evolving to be stream oriented, so that the data is extracted as quickly as possible from incoming data, while still supporting traditional data scenarios such as batch processing, data warehousing, and interactive exploration.

The SMACK Stack

The SMACK Stack architecture pattern consists of following technologies:

Apache Spark (Processing Engine) – Fast and general engine for distributed, large-scale data processing. Often, used to implement ETL, queries, aggregations and applications of machine learning etc. Spark is well-suited to machine learning algorithms, and exposes its API in Scala, Python, R and Java. The approach of Spark is to provide a unified interface that can be used to mix SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.
Apache Mesos (The Container) – A flexible cluster resource management system that provides efficient resource isolation and sharing across distributed applications. Mesos uses a two-level scheduling mechanism where resource offers are made to frameworks (applications that run on top of Mesos). The Mesos master node decides how many resources to offer each framework, while each framework determines the resources it accepts and what application to execute on those resources. This method of resource allocation allows near-optimal data locality when sharing a cluster of nodes amongst diverse frameworks.
Akka (The Model) – Microservice development with high scalability, durability, and low-latency processing. A toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM (Java Virtual Machine). The actor model in computer science is a mathematical model of concurrent computation that treats “actors” as the universal primitives of concurrent computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received.
Apache Cassandra (The Storage) – Scalable, resilient, distributed database designed for persistent, durable storage across multiple data centers. Most environments will also use a distributed file system like HDFS or S3.
Apache Kafka (The Broker) – A high-throughput, low-latency distributed messaging system designed for handling real-time data feeds. In Kafka, data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.

The SMACK stack helps in simplifying the processing of both real-time and batch workloads as single data stream in real or near real time.

SMACK stack: How to put the pieces together

Capturing value in fast data

The best way to capture the value of incoming data is to react to it the instant it arrives. Processing of incoming data in batches always takes lot of time which ultimately takes away the value of that data.

Basically, to handle or process data which is arriving at thousands to millions of events per second, two technologies are required i.e. a streaming system capable of delivering events as fast as they come in and a data store capable of processing each item as fast as it arrives.

Delivering the fast data

In order to deliver fast data, two popular streaming systems i.e. Apache Storm and Apache Kafka have evolved in the market over the past few years. Apache Storm (developed by Twitter) reliably processes unbounded streams of data at rates of millions of messages per second. On the other hand, Kafka (developed by LinkedIn), is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.

Processing the fast data

It is noteworthy that Hadoop MapReduce has very well solved the batch processing needs of customers across the industries. However, the introduction of more flexible developed big data tools for fast data processing paved the demand of big data darling Apache Spark. Owing to its fast data processing power, the technology is setting fire across the entire big data world. The technology has actually taken over Hadoop in terms of fast data processing of interactive data mining algorithms and iterative machine learning algorithms.

Lambda Architecture is data processing architecture and has both stream and batch processing capabilities. However, most of the lambdas solutions cannot serve the two needs at the same time:

Handles a massive data stream in real time.

Handles different and multiple data models from multiple data source.

To easily meet both of these needs, we have Apache Spark which is mainly responsible for real-time analysis of both recent and historical information. The analysis results are often persisted in Apache Cassandra which helps the user to extract the real-time information anytime in case of failure. Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries.

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

apache

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

DataStax

8/3/2024

analytics

streaming

visualization

Keen - Event Streaming Platform

John Doe

2/3/2024

mongo

cassandra

kafka

Top 10 Real-Time Databases to Use in 2024

K.sabreena

1/5/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

cassandra

python

kafka

GitHub - princebhatt9588/Stock_Market_Real_Time_Data_Pipeline_Project_with-Apache-Kafka-and-Cassandra: This app utilizes Python, Apache Kafka, and Cassandra to fetch and process real-time stock market data, providing valuable insights for investors and traders.

princebhatt9588

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

John Doe

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

John Doe

2/17/2023

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us