Ganga ReddyNov 7, 2018·4 min readPeople who are familiar with SMACK stack(Spark, Mesos, Akka, Cassandra, and Kafka) often find themselves working in the spectrum of big data. This has been proven to be a useful stack as each of these frameworks are tested and proven to scaleSpark is an open-source distributed general-purpose computing framework. It is built on top of Hadoop to use in-memory computation and can handle both iterative and exploratory data processing. It is built on resilient data structures like RDDs ( GraphFrames, Datasets built on rdds) and generates a DAG of actions before executing the jobs.Research paper: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdfMesos is a cluster management tool designed toAbstract data center resourcesImprove cluster utilization by colocating diverse workloadsManagement of apps like deployment, self-healing, scaling, and upgradesProvide evergreen extensibilityElastically scale from ten to thousands and more.The following paper is a must-read to understand the design principles of Mesos.However, in my experience, I find SKACK (spark, Kubernetes, Akka, Cassandra, and Kafka) to be more flexible. The comparison between Kubernetes and Mesos is not apt as k8s is a cluster container orchestration tool while Mesos is a full-fledged cluster management tool. But in service-oriented architecture where teams are managing using Iaas (infrastructure as service) or Paas(Platform as service), k8s has penetrated easily, thanks to its simple and elegant abstractions. Though k8s does not offer, data locality across different stateful applications compared to Mesos, its ease of use has lead to a widespread adoption of k8s in the industry.k8s is an open-source cluster management tool for container orchestration.Provisioning and deploymentService Discovery and DNS resolutionScalingMonitoring (Health and Liveness)Management (Rollouts & rollbacks)Everything in k8s is entirely designed around its restful API-Server, which is responsible for doing the actual work in reality what a developer intends to do which is often described using declarative abstractions. There are no private privileged API or other magic system-only calls. The abstractions such aspods, jobs, services, Replicasets, Deployments, statefulsets, constitute a good understanding of the mental model of kubernetes.Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients.Apache Kafka is a distributed publish-subscribe (pub-sub) messaging system which can handle a high volume of data. It is suitable for both offline and online message consumption. Data (in terms of messages) is persisted on the disk and replicated within the cluster to prevent data loss in the event of node failure/network failure. It integrates very well Spark for real-time streaming data analysis.SKACKSpark — General purpose data processing framework. (Batch Processing)Akka — Actor System is a toolkit designed for parallelism, concurrency at scale. (Streaming and other actor models)kafka — General purpose Message publish-subscribe system (Intermediate storage).Cassandra — Horizontally scalable NoSQL database for data persistence(permanent storage).All of the above-mentioned applications can be following can be containerized and couldImage depicting SKACK architectureImage Showing k8s pods for SKACKSpark — Deployment with master and worker pods.Akka — Deployment with pods acting as seed nodes and worker nodes.Kafka — Statefulsets with each pod acting as a brokerCassandra — Statefulsets with pods forming a ringHelm charts for installing SKACK stack could be found here at SKACK chartsIf you like my work, buy me a coffee.