Big Data becomes Fast Data
Transformation of common map (Cyrestis thyodamas ) butterfly image via Shutterstock
Big Data is changing. Buzzwords such as Hadoop, Storm, Pig and Hive are not the darlings of the industry anymore —they are being replaced by a powerful duo: Fast Data and SMACK. Such a fast change in such a (relatively) young ecosystem begs the following question: What is wrong with the current approach? What is the difference between Fast and Big Data? And what is SMACK?
Google retired MapReduce at I/O 2014: By then, one had already switched to the new Dataflow framework and removed the existing MapReduce jobs. This announcement caused a stir since Hadoop and its ecosystem were still seen as ‘innovative’. After a few apocalyptic blog posts and some vigorous debate, calm was restored. Many companies dipped their toes into the Big Data universe but learned a valuable lesson, namely that the limits of many technologies are too restricted for the desired periods of analysis. A new concept was needed. The following article will show you how Big Data (on Hadoop) became Fast Data (with SMACK).
In the beginning there was Lambda
Over the years, the big data world evolved into a confusing zoo of interwoven frameworks and infrastructure components. HDFS, Ceph, ZooKeeper, HBase, Storm, Kafka, Big, Hive, etc. Many of these components are highly specialized and depict only a subset of the intended functionality. Only their – not very unproblematic – combination allows for the execution of more complicated use cases. Over time, it has been shown that many frameworks can be divided into two different groups: On the one side we have the frameworks that respond immediately (box: “Real Time”). This category contains Storm, Samza, different CEP engines, but also reactive frameworks like Akka, Vert.X or Quasar. The other group consists of frameworks that require some time to respond. Here everything is based upon MapReduce, e.g. Pig or Hive. Since both groups usually appeared together, a certain style of architecture came into being. This triggered the name Lambda Architecture  by Nathan Marz (Fig. 1).
Unfortunately, most people understand something totally differently when they hear the name Real Time. This terms refers to the capability of a system to deliver results in a set timeframe. Breaking controllers, medical devices and many parts of satellites need to be Real Time-capable to prevent catastrophes. A brake needs to react quickly when a pedal is pushed, otherwise the driver will get a serious problem. It’s not about “the answer comes as quickly as possible”, but about “the answer will come within the time span”. Much more appropriate would be the term Near Time to describe what Big Data and Fast Data applications are trying to become.
With Near Time, incoming Data (1) is consumed by two layers. Within the Batch Layer, long distant runner analysis is being processed on the deployed raw data. The results of this analysis will be provided to the Serving layer (4), where they can be demanded by clients. The Speed Layer (3) relinquishes performance hungry persistence solutions and fulfills most of its duty in the main memory. Since it can never get hold of all data at the same time, it must focus on a subset. Certain results can thus be determined a lot faster. In the end, the aggregates will also move into the serving layer and can be related to the results of the batch layer. On closer inspection, the bidirectional relation between serving and speed layer stands out. Certain values can be accessed as distributed in-memory data structures e.g. distributed counters and can be gripped live.
Redundant logic can be problematic
In these three layers, many technically elaborated frameworks have been established and have bred success in recent years. But growing success triggers bigger requirements. New analysis should always be formulated faster and in a more flexible way. Results needed to be seen in different resolutions: second, minute, hour, or day. MapReduce quickly reached the limit of what is possible. Thanks to its flexibility and significantly lower response times, more and more analysis moved into the speed layer. However, this did not result in less logic in the batch layer. Since the in-memory processing is limited by the size of the main memory, many studies still need to be carried out in batches.
The often incompatible programming modules meant that a lot of logic needed to be implemented several times. Such redundancies could lead to severely different results when evaluating the same data sources. At this point, it is clear why a unified programming model which covers large areas of the analysis is desirable.
Free: Blockchain Technology Whitepaper
If building a blockchain from scratch is beyond your current scope, the blockchain technology whitepaper is worth a look. Experts from the field share their know-how, tips and tricks, development advice, and strategy for becoming a blockchain master.
The birth of Fast Data
In 2010, the AMPLab at the University of California, Berkeley published a new open source analysis tool which should solve this exact problem. Spark was donated to the Apache Software Foundation in 2013 and has undergone an impressive development ever since. In essence, everything regarding Spark revolves around the so-called Distributed Resilient Data Sets (RDD): distributed, resilient, parallelized data structures. These can be used in conjunction with many different modules:
- Processing of graphs (GraphX)
- Spark SQL, to deal with data from various structured data sources (Hive, JDBC, Parquet, etc.)
- Streaming (Kafka, HDFS, Flume, ZeroMQ, Twitter)
- Machine Learning based on MLib
Besides Scala, Spark also supports Python and R. This makes it easier for data scientists to use it if they are familiar with one of the two. As you can see from the above-mentioned list, Spark is a fairly connection joyous framework and is thus able to combine many of the existing data sources in a unified API. The resulting analysis framework has prevailed rapidly.
The combination of structured data sources and streams makes it possible to combine much of the speed layer with the batch layer into a single interface. Analysis can be performed in almost any resolution. Spark jobs can even be deployed and developed by non-developers in an astounding timeframe. The arguments for Spark are quite clear:
- Scalability — to deal with millions of data sets
- Fast enough to provide answers in Near Time
- Suitable to implement analyses of any duration
- A unified, comprehensible programming model to handle various data sources
But even Spark has its limits. Tools for data delivery and persistency are still necessary. This is where we can resort to the experience of recent years.
Enter the SMACK
One often hears of Spark in conjunction with the SMACK stack — the combination of known technologies from different areas of Big Data analysis into a powerful base framework. The acronym SMACK stands for Spark, Mesos, Akka, Cassandra, and Kafka.
At first glance, one might assume that somebody has opened the hipster technologies box. I suppose you have at least heard of any of the aforementioned frameworks before. But I would nevertheless like to explain the tasks they fulfill in the stack context briefly.
Apache Mesos  is a kernel for distributed systems. It represents an abstraction layer over a cluster of machines. Rather than deploying an application on one note, it is, alongside one requirement description (number of CPUs, required RAM, etc.), passed to Mesos and distributed to appropriate machines. That way, several thousand machines can be specifically utilized pretty easy. Mesos is the central nervous system of the stack. Every component of the SMACK stack is available in Mesos and perfectly integrated into its resource management. In addition, the commercial version Mesosphere is already available on Amazon AWS and Microsoft Azure. A convenient cloud data center can thus be built in record time.
The reactive framework Akka  is based on the known author model from Erlang. In recent years, Akka has evolved into one of the leading frameworks for distributed, resilient applications. In this context, it is mainly used in ingestion range and as access layer in the serving layer.
Another member of the Apache ecosystem is Cassandra . Cassandra is a distributed, resilient, scalable database capable of storing gigantic amounts of data. It supports the distribution across multiple data centers and survives the concurrent failure of multiple notes. In this case, it is used as primary data storage.
Apache Kafka  is often considered a distributed messaging system – and that is true for the most part. In fact, it is nothing more than a distributed commit log. Its simple structure allows users to transfer huge amounts of data between a number of systems, and thereby to scale linearly.
When put together, they form a solid base for Fast Data infrastructures (Fig. 2): Akka (1) consumes incoming data like MQTT events, click streams or binaries and writes it directly into corresponding topics in Kafka (2). Now that the data persists, we can decide how fast we want to get different answers. Various Spark jobs (3) consume the data and interpret it into different resolutions:
- Raw data persistency: A job that writes incoming raw data to S3, HDFS or Ceph, and prepares it for later processing.
- Speed Layer: Implementation of “quick win”-analyses, whose results are measured in seconds.
- Batch Layer: Long-termanalysis or machine learning
Results are written to HDFS (5) and Cassandra (4), and can be used as input for other jobs. In the end, there is Akka again as HTTP layer to display the data e.g. as a web interface.
In addition to technical core components, automation is a key point in determining the success or failure of a real Fast Data platform. And Mesos already provides many important basic components for that. Nevertheless, we will continue to need tools like Terraform, Ansible, Kubernetes and comprehensive monitoring infrastructures. At this point it should be clear where I am heading: Without DevOps, it is difficult to achieve the goals set. Cooperation between developer and operator is essential for a system, which is intended to elastically scale and work on hundreds of machines.
To Scala or not
Scala is notoriously the parting of the ways. However, I want this article to deliberately initiate another debate on language features. In this particular case, the normative power of reality slams because every framework used is either written in Scala or is very Scala-like:
- Akka: written in Scala; primary API in Scala, but also available in Java
- Spark: written in Scala; primary API available in Scala, Python and R
- Kafka: written in Scala; integrated into Spark and little direct coding is necessary (however, since 0.9 primary Java API)
- Cassandra: written in Scala; Interaction primarily takes place via CQL and Spark integration
A SMACK developer will not get past Scala code. Scala is the pragmatic choice if you want to succeed with the stack.
Hadoop is not dead. In the future, HDFS, YARN, ZooKeeper and Co. will still remain important components of our Big Data world. However, what changes is the way in which they are being used. Existing components will, thanks to Spark, be usable within one unified programming model. Spark is just a tool. Much more important are the goals behind the catchphrase “Fast Data”:
- Low entry barrier for Data Scientists
- Differences between Speed and Batch Layer will disappear
- Exploratory analyses will be significantly easier
- The deployment of new jobs will be easier and faster
- Existing infrastructure is easier to use
SMACK offers a combination of all those goals and relies on proven technologies. The key lies in their use and in their highly automated combination. The result is a platform which is hard to beat in its flexibility.
Reactive programming means different things to different people and we are not trying to reinvent the wheel or define this concept. Instead, we are allowing our authors to prove how Scala, Lagom, Spark, Akka and Play coexist and work together to create a reactive universe.
If the definition “stream of events” does not satisfy your thirst for knowledge, get ready to find out what reactive programming means to our experts in Scala, Lagom, Spark, Akka and Play. Plus, we talked to Scala creator Martin Odersky about the impending Scala 2.12, the current state of this programming language and the technical innovations that await us.
Thirsty for more? Open the magazine and see what we have prepared for you.