by Rajdeep Limbu, June 13, 2017
Category: Analytics, Big Data, Big Data Analytics, Fast Data | Tag: Akka, big data, Cassandra, fast data, hadoop, Kafka, Lambda Architecture, Mesos
As enterprises strive to realize and identify the value in Big Data, many now seek more agile and capable analytic systems. Some of the key factors supporting this include improving customer retention, influence product development and quality and increasing operational efficiencies among others. Organizations are looking to enhance their analytics workloads to drive several real-time cases such as recommendation engines, fraud detection and advertising analytics. In this whitepaper, I will try to draw your attention toward the applications of SMACK and how it has emerged as a crucial open source technology to handle high volume and velocity of “Big Data”.
Big Data is not only Big it’s Fast also
The way that big data gets big is through a continuous and constant steam of incoming data. In high-volume environments, that information/data arrives at incredible rates, yet still needs to be stored and analyzed.
Less than a dozen years ago, it was nearly impossible to analyze and process those large chunks of data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost everywhere to analyze petabytes of historical data. Apache Hadoop is a software framework that is being adopted by many enterprises as a cost-effective analytics platform for Big Data analytics. Hadoop is an ecosystem of several services rather than a single product, and is designed for storing and processing petabytes of data in a linear scale-out model. Each service in the Hadoop ecosystem may use different technologies in the processing pipeline to ingest, store, process and visualize data.
What’s next After Big Data?
What Hadoop Can’t Do
The big data movement was pretty much driven by the demand for scale in volume, variety and velocity of data. This is often supported by data that is generated at high speeds, such as financial ticker data, click-stream data, sensor data or log aggregation. Certainly, Apache Hadoop is one of the best technologies to analyze all those large amount of data. However, the technology has a limitation as it cannot be used when the data is moving or in motion. Some of the areas where the use of Hadoop is not recommended are transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.
Nor would it be optimal for structured data sets that require very minimal latency, like when a Web site is served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.
Fast Data – Analyze data when they are in motion or in real-time
SMACK stands for Spark, Mesos, Akka, Cassandra and Kafka – a combination that’s being adopted for ‘fast data’.
The SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka) is known to be as the ideal platform for constructing “fast data” applications. The term fast data underlines how big data applications and architectures are evolving to be stream oriented, so that the data is extracted as quickly as possible from incoming data, while still supporting traditional data scenarios such as batch processing, data warehousing, and interactive exploration.
The SMACK Stack
The SMACK Stack architecture pattern consists of following technologies:
The SMACK stack helps in simplifying the processing of both real-time and batch workloads as single data stream in real or near real time.
SMACK stack: How to put the pieces together
Capturing value in fast data
The best way to capture the value of incoming data is to react to it the instant it arrives. Processing of incoming data in batches always takes lot of time which ultimately takes away the value of that data.
Basically, to handle or process data which is arriving at thousands to millions of events per second, two technologies are required i.e. a streaming system capable of delivering events as fast as they come in and a data store capable of processing each item as fast as it arrives.
Delivering the fast data
In order to deliver fast data, two popular streaming systems i.e. Apache Storm and Apache Kafka have evolved in the market over the past few years. Apache Storm (developed by Twitter) reliably processes unbounded streams of data at rates of millions of messages per second. On the other hand, Kafka (developed by LinkedIn), is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.
Processing the fast data
It is noteworthy that Hadoop MapReduce has very well solved the batch processing needs of customers across the industries. However, the introduction of more flexible developed big data tools for fast data processing paved the demand of big data darling Apache Spark. Owing to its fast data processing power, the technology is setting fire across the entire big data world. The technology has actually taken over Hadoop in terms of fast data processing of interactive data mining algorithms and iterative machine learning algorithms.
Lambda Architecture is data processing architecture and has both stream and batch processing capabilities. However, most of the lambdas solutions cannot serve the two needs at the same time:
Handles a massive data stream in real time.
Handles different and multiple data models from multiple data source.
To easily meet both of these needs, we have Apache Spark which is mainly responsible for real-time analysis of both recent and historical information. The analysis results are often persisted in Apache Cassandra which helps the user to extract the real-time information anytime in case of failure. Spark is increasingly adopted as an alternate processing framework to MapReduce, due to its ability to speed up batch, interactive and streaming analytics. Spark enables new analytics use cases like machine learning and graph analysis with its rich and easy to use programming libraries.