1/15/2018

Reading time:4 min

Open Source Partners: Focus on fast big data

by John Doe

In the October community call, we discussed the opportunity for partners to produce domain specific solutions using Azure Container Service. The November focus for the Open Source Solutions Partner Community is OSS and big data. In this post, I'll provide an in-depth look at a trending solution called the SMACK stack - a valuable solution for every partner's digital transformation toolkit.Sign up for the November 30 community call Watch the October community call on demand SMACK is a technology solution stack that comprises Spark, Mesos, Akka, Cassandra, and Kafka. It is a data-processing architecture designed to handle massive quantities of data that can take advantage of both batch and stream processing methods. It becomes incredibly important when trying to solve problems such as ingesting and querying data produced from the Internet of Things and today's big data producers.Data ingestion from various systems was typically achieved through Extract, Transform, and Load (ETL) systems. However, ETL has some inherent problems:Loss of dataDuplicate data after failoverDecreases throughputExpensive to scaleIncreases the complexity of the pipelineThe SMACK stack is an attempt to rationalize various data processing scenarios. It is made up of highly scalable, reactive frameworks to deliver a fast, Highly-Available Redundantly-Distributed (HARD) system.The diagram below shows how the SMACK stack relates to the first-party services in Microsoft Azure. Partners are increasingly plugging Microsoft first-party services into the SMACK stack where it makes sense. For example, Apache Spark on Azure HDInsight.SparkApache Spark is the processing layer, an open source cluster computing framework that addresses the disk-based limitations in traditional map reduce solutions. Specifically, Spark focuses on providing distributed shared memory primitives that drastically improve performance and interactivity of the data. Spark provides a unified interface allowing SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.Advantages:Distributed analytics platformSimple abstraction of datasetsMultiple language supportStreaming supportMachine learningIntegrated SQL queriesLearn more about Apache Spark R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsightLeverage R and Spark in Azure HDInsight for scalable machine learningBig, Fast, and Data-Furious...with SparkBuild interactive data analysis environments using Apache SparkMesosMesos can be thought as the resource manager or service fabric for the other frameworks. Apache Mesos is an open-source cluster, providing efficient resource isolation and sharing across distributed applications, or frameworks. The software enables resource sharing in a fine-grained manner, improving cluster utilization.Learn more about Apache Mesos Mesos videos on Channel 9 On-demand webcast: Mesosphere and Azure Container Service Akka Akka ingests the data and is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka is message focused and emphasizes actor-based concurrency and is similar to Azure ServiceFabric.Fault tolerantHierarchical supervisionCustomizable failure strategies and detectionAsynchronous data passingParallelizationAdaptive/predictiveLoad-balanced across cluster nodesLearn more about Akka Akka videos on Channel 9Cassandra Apache Cassandra is the storage layer of the stack, an open source distributed database management system designed to handle large amounts of data across many commodity servers. Cassandra is used to persist distributed events, providing high availability with no single point of failure.Massively scalableHigh performanceAlways onMasterlessMultiple datacenter cluster supportLearn more about Apache Cassandra Cassandra videos on Channel 9 December 16: Building geo-distributed public cloud apps on CassandraKafkaApache Kafka is the transportation layer and buffer for dealing with event streams. It provides:High throughput distributed messagingDecouples data pipelinesHandles massive data loadSupport massive number of consumersDistribution and partitioning across cluster nodesAutomatic recovery from broker failuresLearn more about Apache KafkaThe SMACK stack simplifies streaming analytics, but there is a need for partners with a full stack knowledge and domain expertise. For example, ESRI, a Microsoft Gold Application Development Partner, recently demonstrated a fantastic partner-created solution with its forthcoming ArcGIS service, which utilizes DC/OS by way of Azure Container Service. ArcGIS takes advantage of data systems such as Spark Streaming, Kafka, and Elasticsearch, as well as Azure IoT Hub, in order to analyze and visualize geospatial data in real time. This is a packaged offering that ESRI can provide to their customers as a managed service.Whether your partner business focuses on data platform, advanced analytics, IoT, or application development, understanding SMACK is critical for your architects. The attributes of each of the frameworks that make up the SMACK stack act as a patterns for reactive, Highly-Available Redundantly-Distributed systems.The demand for SMACK expertise is growing rapidly, it provides deep business value, and is a perfect fit for hyperscale properties of Microsoft Azure.Microsoft Ignite sessions on demandStreaming in the Cloud: We've Got It All CoveredAzure Container Service sessionsTraining recommendationsWebcast series about open source on Microsoft Azure Training and certification for Azure Community call schedule Blog series Yammer groupTraining and enablement

Read this article if you want to know more about Open Source Partners: Focus on fast big data

Tim Walton - Technology Solutions Professional Open Source

In the October community call, we discussed the opportunity for partners to produce domain specific solutions using Azure Container Service. The November focus for the Open Source Solutions Partner Community is OSS and big data. In this post, I'll provide an in-depth look at a trending solution called the SMACK stack - a valuable solution for every partner's digital transformation toolkit.

Sign up for the November 30 community call

Watch the October community call on demand

SMACK is a technology solution stack that comprises Spark, Mesos, Akka, Cassandra, and Kafka. It is a data-processing architecture designed to handle massive quantities of data that can take advantage of both batch and stream processing methods. It becomes incredibly important when trying to solve problems such as ingesting and querying data produced from the Internet of Things and today's big data producers.

Data ingestion from various systems was typically achieved through Extract, Transform, and Load (ETL) systems. However, ETL has some inherent problems:

Loss of data
Duplicate data after failover
Decreases throughput
Expensive to scale
Increases the complexity of the pipeline

The SMACK stack is an attempt to rationalize various data processing scenarios. It is made up of highly scalable, reactive frameworks to deliver a fast, Highly-Available Redundantly-Distributed (HARD) system.

The diagram below shows how the SMACK stack relates to the first-party services in Microsoft Azure. Partners are increasingly plugging Microsoft first-party services into the SMACK stack where it makes sense. For example, Apache Spark on Azure HDInsight.

smack-stack-image-oss-nov-2016-blog

Spark

Apache Spark is the processing layer, an open source cluster computing framework that addresses the disk-based limitations in traditional map reduce solutions. Specifically, Spark focuses on providing distributed shared memory primitives that drastically improve performance and interactivity of the data. Spark provides a unified interface allowing SQL queries, machine learning, graph analysis, and streaming (micro-batched) processing.

Advantages:

Distributed analytics platform
Simple abstraction of datasets
Multiple language support
Streaming support
Machine learning
Integrated SQL queries

Learn more about Apache Spark

R & Spark as Yin and Yang of Scalable Machine Learning in Azure HDInsight

Leverage R and Spark in Azure HDInsight for scalable machine learning

Big, Fast, and Data-Furious...with Spark

Build interactive data analysis environments using Apache Spark

Mesos

Mesos can be thought as the resource manager or service fabric for the other frameworks. Apache Mesos is an open-source cluster, providing efficient resource isolation and sharing across distributed applications, or frameworks. The software enables resource sharing in a fine-grained manner, improving cluster utilization.

Learn more about Apache Mesos

Mesos videos on Channel 9

On-demand webcast: Mesosphere and Azure Container Service

Akka

Akka ingests the data and is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka is message focused and emphasizes actor-based concurrency and is similar to Azure ServiceFabric.

Fault tolerant
Hierarchical supervision
Customizable failure strategies and detection
Asynchronous data passing
Parallelization
Adaptive/predictive
Load-balanced across cluster nodes

Learn more about Akka

Akka videos on Channel 9

Cassandra

Apache Cassandra is the storage layer of the stack, an open source distributed database management system designed to handle large amounts of data across many commodity servers. Cassandra is used to persist distributed events, providing high availability with no single point of failure.

Massively scalable
High performance
Always on
Masterless
Multiple datacenter cluster support

Learn more about Apache Cassandra

Cassandra videos on Channel 9

December 16: Building geo-distributed public cloud apps on Cassandra

Kafka

Apache Kafka is the transportation layer and buffer for dealing with event streams. It provides:

High throughput distributed messaging
Decouples data pipelines
Handles massive data load
Support massive number of consumers
Distribution and partitioning across cluster nodes
Automatic recovery from broker failures

Learn more about Apache Kafka

The SMACK stack simplifies streaming analytics, but there is a need for partners with a full stack knowledge and domain expertise. For example, ESRI, a Microsoft Gold Application Development Partner, recently demonstrated a fantastic partner-created solution with its forthcoming ArcGIS service, which utilizes DC/OS by way of Azure Container Service. ArcGIS takes advantage of data systems such as Spark Streaming, Kafka, and Elasticsearch, as well as Azure IoT Hub, in order to analyze and visualize geospatial data in real time. This is a packaged offering that ESRI can provide to their customers as a managed service.

Whether your partner business focuses on data platform, advanced analytics, IoT, or application development, understanding SMACK is critical for your architects. The attributes of each of the frameworks that make up the SMACK stack act as a patterns for reactive, Highly-Available Redundantly-Distributed systems.

The demand for SMACK expertise is growing rapidly, it provides deep business value, and is a perfect fit for hyperscale properties of Microsoft Azure.

Microsoft Ignite sessions on demand

Streaming in the Cloud: We've Got It All Covered

Azure Container Service sessions

Training recommendations

Webcast series about open source on Microsoft Azure

Training and certification for Azure

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Spark

Mesos

Akka

Cassandra

Kafka

Microsoft Ignite sessions on demand

Training recommendations

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us