Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos …

Successfully reported this slideshow.

Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Rahul Kumar
Technical Lead
Sigmoid
Real Time data pipeline with Spark Streaming and
Cassandra with Mesos
About Sigmoid
© DataStax, All Rights Reserved. 2
We build reactive real-time big data systems.
1 Data Management
2 Cassandra Introduction
3 Apache Spark Streaming
4 Reactive Data Pipelines
5 Use cases
3© DataStax, All...
Data Management
© DataStax, All Rights Reserved. 4
Managing data and analyzing
data have always greatest
benefit and the g...
Three V’s of Big data
© DataStax, All Rights Reserved. 5
Scale Vertically
© DataStax, All Rights Reserved. 6
Scale Horizontally
© DataStax, All Rights Reserved. 7
Understanding Distributed Application
© DataStax, All Rights Reserved. 8
“ A distributed system is a software system in wh...
Principles Of Distributed Application Design
© DataStax, All Rights Reserved. 9
 Availability
 Performance
 Reliability...
Reactive Application
© DataStax, All Rights Reserved. 10
Reactive libraries, tools and frameworks
© DataStax, All Rights Reserved. 11
Cassandra Introduction
© DataStax, All Rights Reserved. 13
Cassandra - is an Open Source, distributed store for structured...
Why Cassandra
© DataStax, All Rights Reserved. 14
Highly scalable NoSQL database
© DataStax, All Rights Reserved. 15
 Cassandra supplies linear
scalability
 Cassandra is ...
High Availability
© DataStax, All Rights Reserved. 16
 In a Cassandra cluster all
nodes are equal.
 There are no masters...
Read/Write any where
© DataStax, All Rights Reserved. 17
 Cassandra is a R/W
anywhere architecture, so
any user/app can c...
High Performance
© DataStax, All Rights Reserved. 18
 All disk writes are
sequential, append-only
operations.
 Ensure No...
Cassandra & CAP
© DataStax, All Rights Reserved. 19
 Cassandra is classified as
an AP system
 System is still available
...
CQL
© DataStax, All Rights Reserved. 20
CREATE KEYSPACE MyAppSpace WITH
REPLICATION = { 'class' : 'SimpleStrategy', 'repli...
Apache Spark
© DataStax, All Rights Reserved. 21
Introduction
 Apache Spark is a fast and
general execution engine
for la...
RDD Introduction
© DataStax, All Rights Reserved. 22
Resilient Distributed Datasets (RDDs), a distributed memory
abstracti...
RDD Operations
© DataStax, All Rights Reserved. 23
Two Kind of Operations
• Transformation
• Action
What is Spark Streaming?
© DataStax, All Rights Reserved. 26
Framework for large scale stream processing
➔ Created at UC B...
Spark Streaming
© DataStax, All Rights Reserved. 27
Introduction
• Spark Streaming is an
extension of the core spark
API t...
Spark Streaming over a HA Mesos Cluster
© DataStax, All Rights Reserved. 31
To use Mesos from Spark, you need a Spark bina...
Spark Cassandra Connector
© DataStax, All Rights Reserved. 32
 It allows us to expose Cassandra tables as Spark RDDs
 Wr...
© DataStax, All Rights Reserved. 33
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
li...
© DataStax, All Rights Reserved. 34
val rdd = sc.cassandraTable(“applog”, “accessTable”)
println(rdd.count)
println(rdd.fi...
Many more higher order functions:
© DataStax, All Rights Reserved. 35
repartitionByCassandraReplica : It be used to reloca...
Hint to scalable pipeline
© DataStax, All Rights Reserved. 36
Figure out the bottleneck : CPU, Memory, IO, Network
If pars...
Thank You
@rahul_kumar_aws
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul Kumar, Sigmoid) | C* Summit 2016

Upcoming SlideShare

Loading in …5

×

  1. 1. Rahul Kumar Technical Lead Sigmoid Real Time data pipeline with Spark Streaming and Cassandra with Mesos
  2. 2. About Sigmoid © DataStax, All Rights Reserved. 2 We build reactive real-time big data systems.
  3. 3. 1 Data Management 2 Cassandra Introduction 3 Apache Spark Streaming 4 Reactive Data Pipelines 5 Use cases 3© DataStax, All Rights Reserved.
  4. 4. Data Management © DataStax, All Rights Reserved. 4 Managing data and analyzing data have always greatest benefit and the greatest challenges for organization.
  5. 5. Three V’s of Big data © DataStax, All Rights Reserved. 5
  6. 6. Scale Vertically © DataStax, All Rights Reserved. 6
  7. 7. Scale Horizontally © DataStax, All Rights Reserved. 7
  8. 8. Understanding Distributed Application © DataStax, All Rights Reserved. 8 “ A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.”
  9. 9. Principles Of Distributed Application Design © DataStax, All Rights Reserved. 9  Availability  Performance  Reliability  Scalability  Manageability  Cost
  10. 10. Reactive Application © DataStax, All Rights Reserved. 10
  11. 11. Reactive libraries, tools and frameworks © DataStax, All Rights Reserved. 11
  12. 12. Cassandra Introduction © DataStax, All Rights Reserved. 13 Cassandra - is an Open Source, distributed store for structured data that scale-out on cheap, commodity hardware. Born at Facebook, built on Amazon’s Dynamo and Google’s BigTable
  13. 13. Why Cassandra © DataStax, All Rights Reserved. 14
  14. 14. Highly scalable NoSQL database © DataStax, All Rights Reserved. 15  Cassandra supplies linear scalability  Cassandra is a partitioned row store database  Automatic data distribution  Built-in and customizable replication
  15. 15. High Availability © DataStax, All Rights Reserved. 16  In a Cassandra cluster all nodes are equal.  There are no masters or coordinators at the cluster level.  Gossip protocol allows nodes to be aware of each other.
  16. 16. Read/Write any where © DataStax, All Rights Reserved. 17  Cassandra is a R/W anywhere architecture, so any user/app can connect to any node in any DC and read/write the data.
  17. 17. High Performance © DataStax, All Rights Reserved. 18  All disk writes are sequential, append-only operations.  Ensure No reading before write.
  18. 18. Cassandra & CAP © DataStax, All Rights Reserved. 19  Cassandra is classified as an AP system  System is still available under partition
  19. 19. CQL © DataStax, All Rights Reserved. 20 CREATE KEYSPACE MyAppSpace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; USE MyAppSpace ; CREATE COLUMNFAMILY AccessLog(id text, ts timestamp ,ip text, port text, status text, PRIMARY KEY(id)); INSERT INTO AccessLog (id, ts, ip, port, status) VALUES (’id-001-1', 2016-01-01 00:00:00+0200', ’10.20.30.1’,’200’); SELECT * FROM AccessLog ;
  20. 20. Apache Spark © DataStax, All Rights Reserved. 21 Introduction  Apache Spark is a fast and general execution engine for large-scale data processing.  Organize computation as concurrent tasks  Handle fault-tolerance, load balancing  Developed on Actor Model
  21. 21. RDD Introduction © DataStax, All Rights Reserved. 22 Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDD shared the data over a cluster, like a virtualized, distributed collection. Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects such as List, Map etc.
  22. 22. RDD Operations © DataStax, All Rights Reserved. 23 Two Kind of Operations • Transformation • Action
  23. 23. What is Spark Streaming? © DataStax, All Rights Reserved. 26 Framework for large scale stream processing ➔ Created at UC Berkeley ➔ Scales to 100s of nodes ➔ Can achieve second scale latencies ➔ Provides a simple batch-like API for implementing complex algorithm ➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
  24. 24. Spark Streaming © DataStax, All Rights Reserved. 27 Introduction • Spark Streaming is an extension of the core spark API that enables scalable, high-throughput, fault- tolerant stream processing of live data streams.
  25. 25. Spark Streaming over a HA Mesos Cluster © DataStax, All Rights Reserved. 31 To use Mesos from Spark, you need a Spark binary package available in a place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos. Configuring the driver program to connect to Mesos: val sconf = new SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName(”HAStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.6.0-bin-hadoop2.6.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory", "10g") val sc = new SparkContext(sconf) val ssc = new StreamingContext(sc, Seconds(1))
  26. 26. Spark Cassandra Connector © DataStax, All Rights Reserved. 32  It allows us to expose Cassandra tables as Spark RDDs  Write Spark RDDs to Cassandra tables  Execute arbitrary CQL queries in your Spark applications.  Compatible with Apache Spark 1.0 through 2.0  It Maps table rows to CassandraRow objects or tuples  Do Join with a subset of Cassandra data  Partition RDDs according to Cassandra replication
  27. 27. © DataStax, All Rights Reserved. 33 resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven" libraryDependencies += "datastax" % "spark-cassandra-connector" % "1.6.0-s_2.10" build.sbt should include: import com.datastax.spark.connector._
  28. 28. © DataStax, All Rights Reserved. 34 val rdd = sc.cassandraTable(“applog”, “accessTable”) println(rdd.count) println(rdd.first) println(rdd.map(_.getInt("value")).sum) collection.saveToCassandra(“applog”, "accessTable", SomeColumns(”city", ”count")) Save Data Back to Cassandra Get a Spark RDD that represents a Cassandra table
  29. 29. Many more higher order functions: © DataStax, All Rights Reserved. 35 repartitionByCassandraReplica : It be used to relocate data in an RDD to match the replication strategy of a given table and keyspace joinWithCassandraTable : The connector supports using any RDD as a source of a direct join with a Cassandra Table
  30. 30. Hint to scalable pipeline © DataStax, All Rights Reserved. 36 Figure out the bottleneck : CPU, Memory, IO, Network If parsing is involved, use the one which gives high performance. Proper Data modeling Compression, Serialization
  31. 31. Thank You @rahul_kumar_aws