In the October community call, we discussed the opportunity for partners to produce domain specific solutions using Azure Container Service. The November focus for the Open Source Solutions Partner Community is OSS and big data. In this post, I'll provide an in-depth look at a trending solution called the SMACK stack - a valuable solution for every partner's digital transformation toolkit.
Sign up for the November 30 community call
Watch the October community call on demand
SMACK is a technology solution stack that comprises Spark, Mesos, Akka, Cassandra, and Kafka. It is a data-processing architecture designed to handle massive quantities of data that can take advantage of both batch and stream processing methods. It becomes incredibly important when trying to solve problems such as ingesting and querying data produced from the Internet of Things and today's big data producers.
Data ingestion from various systems was typically achieved through Extract, Transform, and Load (ETL) systems. However, ETL has some inherent problems:
- Loss of data
- Duplicate data after failover
- Decreases throughput
- Expensive to scale
- Increases the complexity of the pipeline
The SMACK stack is an attempt to rationalize various data processing scenarios. It is made up of highly scalable, reactive frameworks to deliver a fast, Highly-Available Redundantly-Distributed (HARD) system.
The diagram below shows how the SMACK stack relates to the first-party services in Microsoft Azure. Partners are increasingly plugging Microsoft first-party services into the SMACK stack where it makes sense. For example, Apache Spark on Azure HDInsight.
Spark
Apache Spark is the processing layer, an open source cluster computing framework that addresses the disk-based limitations in traditional map reduce solutions. Specifically, Spark focuses on providing distributed shared memory primitives that drastically improve performance and interactivity of the data. Spark provides a unified interface allowing SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.
Advantages:
- Distributed analytics platform
- Simple abstraction of datasets
- Multiple language support
- Streaming support
- Machine learning
- Integrated SQL queries
R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight
Leverage R and Spark in Azure HDInsight for scalable machine learning
Big, Fast, and Data-Furious...with Spark
Build interactive data analysis environments using Apache Spark
Mesos
Mesos can be thought as the resource manager or service fabric for the other frameworks. Apache Mesos is an open-source cluster, providing efficient resource isolation and sharing across distributed applications, or frameworks. The software enables resource sharing in a fine-grained manner, improving cluster utilization.
On-demand webcast: Mesosphere and Azure Container Service
Akka
Akka ingests the data and is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka is message focused and emphasizes actor-based concurrency and is similar to Azure ServiceFabric.
- Fault tolerant
- Hierarchical supervision
- Customizable failure strategies and detection
- Asynchronous data passing
- Parallelization
- Adaptive/predictive
- Load-balanced across cluster nodes
Cassandra
Apache Cassandra is the storage layer of the stack, an open source distributed database management system designed to handle large amounts of data across many commodity servers. Cassandra is used to persist distributed events, providing high availability with no single point of failure.
- Massively scalable
- High performance
- Always on
- Masterless
- Multiple datacenter cluster support
Learn more about Apache Cassandra
December 16: Building geo-distributed public cloud apps on Cassandra
Kafka
Apache Kafka is the transportation layer and buffer for dealing with event streams. It provides:
- High throughput distributed messaging
- Decouples data pipelines
- Handles massive data load
- Support massive number of consumers
- Distribution and partitioning across cluster nodes
- Automatic recovery from broker failures
The SMACK stack simplifies streaming analytics, but there is a need for partners with a full stack knowledge and domain expertise. For example, ESRI, a Microsoft Gold Application Development Partner, recently demonstrated a fantastic partner-created solution with its forthcoming ArcGIS service, which utilizes DC/OS by way of Azure Container Service. ArcGIS takes advantage of data systems such as Spark Streaming, Kafka, and Elasticsearch, as well as Azure IoT Hub, in order to analyze and visualize geospatial data in real time. This is a packaged offering that ESRI can provide to their customers as a managed service.
Whether your partner business focuses on data platform, advanced analytics, IoT, or application development, understanding SMACK is critical for your architects. The attributes of each of the frameworks that make up the SMACK stack act as a patterns for reactive, Highly-Available Redundantly-Distributed systems.
The demand for SMACK expertise is growing rapidly, it provides deep business value, and is a perfect fit for hyperscale properties of Microsoft Azure.
Microsoft Ignite sessions on demand
Streaming in the Cloud: We've Got It All Covered
Azure Container Service sessions
Training recommendations
Webcast series about open source on Microsoft Azure