6/27/2018

Reading time:13 mins

Streaming Analytics with Spark, Kafka, Cassandra and Akka

by Helena Edelson

Streaming Analytics with Spark, Kafka, Cassandra and Akka SlideShare Explore You Successfully reported this slideshow.Streaming Analytics with Spark, Kafka, Cassandra and AkkaUpcoming SlideShareLoading in …5× 3 Comments 80 Likes Statistics Notes ffch1996 Jasonmao5 Raphaël Bacconnier , Backend Developer at Backend / big data developer Jhanak Parajuli , Data Scientist and AI Engineer at msg at Data Scientist and AI Engineer Raja Rp at Infosys Show More No DownloadsNo notes for slide 1. Streaming Analytics with Spark,Kafka, Cassandra, and AkkaHelena EdelsonVP of Product Engineering @Tuplejump 2. • Committer / Contributor: Akka, FiloDB, Spark CassandraConnector, Spring Integration• VP of Product Engineering @Tuplejump• Previously: Sr Cloud Engineer / Architect at VMware,CrowdStrike, DataStax and SpringSourceWho@helenaedelsongithub.com/helena 3. TuplejumpTuplejump Data Blender combines sophisticated data collectionwith machine learning and analytics, to understand the intention ofthe analyst, without disrupting workﬂow.• Ingest streaming and static data from disparate data sources• Combine them into a uniﬁed, holistic view • Easily enable fast, ﬂexible and advanced data analysis3 4. Tuplejump Open Sourcegithub.com/tuplejump• FiloDB - distributed, versioned, columnar analytical db for modernstreaming workloads• Calliope - the first Spark-Cassandra integration• Stargate - Lucene indexer for Cassandra• SnackFS - HDFS-compatible file system for Cassandra4 5. What Will We Talk About• The Problem Domain• Example Use Case• Rethinking Architecture– We don't have to look far to look back– Streaming– Revisiting the goal and the stack– Simplification 6. THE PROBLEM DOMAINDelivering Meaning From A Flood Of Data6 7. The Problem DomainNeed to build scalable, fault tolerant, distributed dataprocessing systems that can handle massive amounts ofdata from disparate sources, with different data structures.7 8. TranslationHow to build adaptable, elegant systemsfor complex analytics and learning tasksto run as large-scale clustered dataflows8 9. How Much DataYottabyte = quadrillion gigabytes or septillionbytes9We all have a lot of data• Terabytes• Petabytes...https://en.wikipedia.org/wiki/Yottabyte 10. Delivering Meaning• Deliver meaning in sec/sub-sec latency• Disparate data sources & schemas• Billions of events per second• High-latency batch processing• Low-latency stream processing• Aggregation of historical from the stream 11. While We Monitor, Predict & Proactively Handle• Massive event spikes• Bursty traffic• Fast producers / slow consumers• Network partitioning & Out of sync systems• DC down• Wait, we've DDOS'd ourselves from fast streams?• Autoscale issues– When we scale down VMs how do we not lose data? 12. And stay within ourAWS / Rackspace budget 13. EXAMPLE CASE:CYBER SECURITYHunting The Hunter13 14. 14• Track activities of international threat actor groups,nation-state, criminal or hactivist• Intrusion attempts• Actual breaches• Profile adversary activity• Analysis to understand their motives, anticipate actionsand prevent damageAdversary Profiling & Hunting 15. 15• Machine events• Endpoint intrusion detection• Anomalies/indicators of attack or compromise• Machine learning• Training models based on patterns from historical data• Predict potential threats• profiling for adversary Identification•Stream Processing 16. Data Requirements & Description• Streaming event data• Log messages• User activity records• System ops & metrics data• Disparate data sources• Wildly differing data structures16 17. Massive Amounts Of Data17• One machine can generate 2+ TB per day• Tracking millions of devices• 1 million writes per second - bursty• High % writes, lower % reads• TTL 18. RETHINKINGARCHITECTURE18 19. WE DON'T HAVE TO LOOKFAR TO LOOK BACK19Rethinking Architecture 20. 20Most batch analytics flow fromseveral years ago looked like... 21. STREAMING & DATA SCIENCE21Rethinking Architecture 22. StreamingI need fast access to historical data on the fly forpredictive modeling with real time data from the stream.22 23. Not A Stream, A Flood• Data emitters• Netflix: 1 - 2 million events per second at peak• 750 billion events per day• LinkedIn: > 500 billion events per day• Data ingesters• Netflix: 50 - 100 billion events per day• LinkedIn: 2.5 trillion events per day• 1 Petabyte of streaming data23 24. Which Translates To• Do it fast• Do it cheap• Do it at scale24 25. Challenges• Code changes at runtime• Distributed Data Consistency• Ordering guarantees• Complex compute algorithms25 26. Oh, and don't lose data26 27. Strategies• Partition For Scale & Data Locality• Replicate For Resiliency• Share Nothing• Fault Tolerance• Asynchrony• Async Message Passing• Memory Management27• Data lineage and reprocessing inruntime• Parallelism• Elastically Scale• Isolation• Location Transparency 28. AND THEN WE GREEKED OUT28Rethinking Architecture 29. Lambda ArchitectureA data-processing architecture designed to handle massivequantities of data by taking advantage of both batch andstream processing methods.29 30. Lambda ArchitectureA data-processing architecture designed to handle massivequantities of data by taking advantage of both batch andstream processing methods.• An approach• Coined by Nathan Marz• This was a huge stride forward30 31. 31https://www.mapr.com/developercentral/lambda-architecture 32. Implementing Is Hard33• Real-time pipeline backed by KV store for updates• Many moving parts - KV store, real time, batch• Running similar code in two places• Still ingesting data to Parquet/HDFS• Reconcile queries against two different places 33. Performance Tuning & Monitoring:on so many systems34Also hard 34. Lambda ArchitectureAn immutable sequence of records is captured and fedinto a batch system and a stream processingsystem in parallel.35 35. WAIT, DUAL SYSTEMS?36Challenge Assumptions 36. Which Translates To• Performing analytical computations & queries in dualsystems• Implementing transformation logic twice• Duplicate Code• Spaghetti Architecture for Data Flows• One Busy Network37 37. Why Dual Systems?• Why is a separate batch system needed?• Why support code, machines and running services oftwo analytics systems?38Counter productive on some level? 38. YES39• A unified system for streaming and batch• Real-time processing and reprocessing• Code changes• Fault tolerancehttp://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps 39. ANOTHER ASSUMPTION:ETL40Challenge Assumptions 40. Extract, Transform, Load (ETL)41"Designing and maintaining the ETL process is oftenconsidered one of the most difficult and resource-intensive portions of a data warehouse project."http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 41. Extract, Transform, Load (ETL)42ETL involves• Extraction of data from one system into another• Transforming it• Loading it into another system 42. Extract, Transform, Load (ETL)"Designing and maintaining the ETL process is oftenconsidered one of the most difficult and resource-intensive portions of a data warehouse project."http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm43Also unnecessarily redundant and often typeless 43. ETL44• Each ETL step can introduce errors and risk• Can duplicate data after failover• Tools can cost millions of dollars• Decreases throughput• Increased complexity 44. ETL• Writing intermediary files• Parsing and re-parsing plain text45 45. And let's duplicate the patternover all our DataCenters46 46. 47These are not the solutions you're looking for 47. REVISITING THE GOAL& THE STACK48 48. Removing The 'E' in ETLThanks to technologies like Avro and Protobuf we don’t need the“E” in ETL. Instead of text dumps that you need to parse overmultiple systems:Scala & Avro (e.g.)• Can work with binary data that remains strongly typed• A return to strong typing in the big data ecosystem49 49. Removing The 'L' in ETLIf data collection is backed by a distributed messagingsystem (e.g. Kafka) you can do real-time fanout of theingested data to all consumers. No need to batch "load".• From there each consumer can do their own transformations50 50. #NoMoreGreekLetterArchitectures51 51. NoETL52 52. Strategy TechnologiesScalable Infrastructure / Elastic Spark, Cassandra, KafkaPartition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka ClusterReplicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ringShare Nothing, Masterless Cassandra, Akka Cluster both Dynamo styleFault Tolerance / No Single Point of Failure Spark, Cassandra, KafkaReplay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka PersistenceFailure Detection Cassandra, Spark, Akka, KafkaConsensus & Gossip Cassandra & Akka ClusterParallelism Spark, Cassandra, Kafka, AkkaAsynchronous Data Passing Kafka, Akka, SparkFast, Low Latency, Data Locality Cassandra, Spark, KafkaLocation Transparency Akka, Spark, Cassandra, KafkaMy Nerdy Chart53 53. SMACK• Scala/Spark• Mesos• Akka• Cassandra• Kafka54 54. Spark Streaming55 55. Spark Streaming• One runtime for streaming and batch processing• Join streaming and static data sets• No code duplication• Easy, flexible data ingestion from disparate sources todisparate sinks• Easy to reconcile queries against multiple sources• Easy integration of KV durable storage56 56. How do I merge historical data with datain the stream?57 57. Join Streams With Static Dataval ssc = new StreamingContext(conf, Milliseconds(500))ssc.checkpoint("checkpoint")val staticData: RDD[(Int,String)] =ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func)val stream: DStream[(Int,String)] =KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)).transform { events => events.join(staticData)).saveToCassandra(keyspace,table)ssc.start()58 58. TrainingDataFeatureExtractionModelTrainingModelTestingTest DataYour Data Extract Data To AnalyzeTrain your model to predictSpark MLLib59 59. Spark Streaming & ML60val context = new StreamingContext(conf, Milliseconds(500))val model = KMeans.train(dataset, ...) // learn offlineval stream = KafkaUtils.createStream(ssc, zkQuorum, group,..).map(event => model.predict(event.feature)) 60. Apache MesosOpen-source cluster manager developed at UC Berkeley.Abstracts CPU, memory, storage, and other compute resourcesaway from machines (physical or virtual), enabling fault-tolerantand elastic distributed systems to easily be built and runeffectively.61 61. AkkaHigh performance concurrency framework for Scala andJava• Fault Tolerance• Asynchronous messaging and data processing• Parallelization• Location Transparency• Local / Remote Routing• Akka: Cluster / Persistence / Streams62 62. Akka ActorsA distribution and concurrency abstraction• Compute Isolation• Behavioral Context Switching• No Exposed Internal State• Event-based messaging• Easy parallelism• Configurable fault tolerance63 63. 64Akka Actor Hierarchyhttp://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala 64. import akka.actor._class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {val temperature = context.actorOf(Props(new TemperatureActor(args)), "temperature")val precipitation = context.actorOf(Props(new PrecipitationActor(args)), "precipitation")override def preStart(): Unit = { /* lifecycle hook: init */ }def receive : Actor.Receive = {case Initialized => context become initialized}def initialized : Actor.Receive = {case e: SomeEvent => someFunc(e)case e: OtherEvent => otherFunc(e)}}65 65. Apache Cassandra• Extremely Fast• Extremely Scalable• Multi-Region / Multi-Datacenter• Always On• No single point of failure• Survive regional outages• Easy to operate• Automatic & configurable replication 66 66. Apache Cassandra• Very flexible data modeling (collections, user definedtypes) and changeable over time• Perfect for ingestion of real time / machine data• Huge community67 67. Spark Cassandra Connector• NOSQL JOINS!• Write & Read data between Spark and Cassandra• Compatible with Spark 1.4• Handles Data Locality for Speed• Implicit type conversions• Server-Side Filtering - SELECT, WHERE, etc.• Natural Timeseries Integration68http://github.com/datastax/spark-cassandra-connector 68. KillrWeather69http://github.com/killrweather/killrweatherA reference application showing how to easily integrate streaming andbatch data processing with Apache Spark Streaming, ApacheCassandra, Apache Kafka and Akka for fast, streaming computationson time series data in asynchronous event-driven environments.http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/databricks/apps/weather 69. 70• High Throughput Distributed Messaging• Decouples Data Pipelines• Handles Massive Data Load• Support Massive Number of Consumers• Distribution & partitioning across cluster nodes• Automatic recovery from broker failures 70. Spark Streaming & Kafkaval context = new StreamingContext(conf, Seconds(1))val wordCount = KafkaUtils.createStream(context, ...).flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _)wordCount.saveToCassandra(ks,table)context.start() // start receiving and computing71 71. 72class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)extends AggregationActor(settings: Settings) { import settings._ val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_))  kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw) /** RawWeatherData: wsid, year, month, day, oneHourPrecip */ kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)  /** Now the [[StreamingContext]] can be started. */ context.parent ! OutputStreamInitialized  def receive : Actor.Receive = {…}}Gets the partition key: Data LocalitySpark C* Connector feeds this to SparkCassandra Counter column in our schema,no expensive `reduceByKey` needed. Simplylet C* do it: not expensive and fast. 72. 73/** For a given weather station, calculates annual cumulative precip - or year to date. */ class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {  def receive : Actor.Receive = { case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender) case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender) }  /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */ def cumulative(wsid: String, year: Int, requester: ActorRef): Unit = ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync() .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester  /** Returns the 10 highest temps for any station in the `year`. */ def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = { val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year, ssc.sparkContext.parallelize(aggregate).top(k).toSeq)  ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync().map(toTopK) pipeTo requester } } 73. A New Approach• One Runtime: streaming, scheduled• Simplified architecture• Allows us to• Write different types of applications• Write more type safe code• Write more reusable code74 74. Need daily analytics aggregate reports? Do it in the stream, saveresults in Cassandra for easy reporting as needed - with datalocality not offered by S3. 75. FiloDBDistributed, columnar database designed to run very fastanalytical queries• Ingest streaming data from many streaming sources• Row-level, column-level operations and built in versioningoffer greater flexibility than file-based technologies• Currently based on Apache Cassandra & Spark• github.com/tuplejump/FiloDB76 76. FiloDB• Breakthrough performance levels for analytical queries• Performance comparable to Parquet• One to two orders of magnitude faster than Spark onCassandra 2.x• Versioned - critical for reprocessing logic/code changes• Can simplify your infrastructure dramatically• Queries run in parallel in Spark for scale-out ad-hoc analysis• Space-saving techniques77 77. WRAPPING UP78 78. Architectyr?79"This is a giant mess"- Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps 79. 80Simplified 80. 81 81. 82www.tuplejump.cominfo@tuplejump.com@tuplejump 82. 83@helenaedelsongithub.com/helenaslideshare.net/helenaedelsonTHANK YOU! 83. I'm speaking at QCon SF on the broadertopic of Streaming at Scalehttp://qconsf.com/sf2015/track/streaming-data-scale84 Recommended Bruce Heavin The Thinkable PresentationOnline Course - LinkedIn Learning Gaining Skills with LinkedIn LearningOnline Course - LinkedIn Learning PowerPoint: Designing Better SlidesOnline Course - LinkedIn Learning Reactive app using actor model & apache sparkRahul Kumar Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil How to deploy Apache Spark  to Mesos/DCOSLegacy Typesafe (now Lightbend) About Blog Terms Privacy Copyright LinkedIn Corporation © 2018 Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Read this article if you want to know more about Streaming Analytics with Spark, Kafka, Cassandra and Akka

Streaming Analytics with Spark, Kafka, Cassandra and Akka

SlideShare Explore You

Successfully reported this slideshow.

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Upcoming SlideShare

Loading in …5

3 Comments

1. Streaming Analytics with Spark, Kafka, Cassandra, and Akka Helena Edelson VP of Product Engineering @Tuplejump
2. • Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration • VP of Product Engineering @Tuplejump • Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource Who @helenaedelson github.com/helena
3. Tuplejump Tuplejump Data Blender combines sophisticated data collection with machine learning and analytics, to understand the intention of the analyst, without disrupting workﬂow. • Ingest streaming and static data from disparate data sources • Combine them into a uniﬁed, holistic view • Easily enable fast, ﬂexible and advanced data analysis 3
4. Tuplejump Open Source github.com/tuplejump • FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads • Calliope - the first Spark-Cassandra integration • Stargate - Lucene indexer for Cassandra • SnackFS - HDFS-compatible file system for Cassandra 4
5. What Will We Talk About • The Problem Domain • Example Use Case • Rethinking Architecture – We don't have to look far to look back – Streaming – Revisiting the goal and the stack – Simplification
6. THE PROBLEM DOMAIN Delivering Meaning From A Flood Of Data 6
7. The Problem Domain Need to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures. 7
8. Translation How to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows 8
9. How Much Data Yottabyte = quadrillion gigabytes or septillion bytes 9 We all have a lot of data • Terabytes • Petabytes... https://en.wikipedia.org/wiki/Yottabyte
10. Delivering Meaning • Deliver meaning in sec/sub-sec latency • Disparate data sources & schemas • Billions of events per second • High-latency batch processing • Low-latency stream processing • Aggregation of historical from the stream
11. While We Monitor, Predict & Proactively Handle • Massive event spikes • Bursty traffic • Fast producers / slow consumers • Network partitioning & Out of sync systems • DC down • Wait, we've DDOS'd ourselves from fast streams? • Autoscale issues – When we scale down VMs how do we not lose data?
12. And stay within our AWS / Rackspace budget
13. EXAMPLE CASE: CYBER SECURITY Hunting The Hunter 13
14. 14 • Track activities of international threat actor groups, nation-state, criminal or hactivist • Intrusion attempts • Actual breaches • Profile adversary activity • Analysis to understand their motives, anticipate actions and prevent damage Adversary Profiling & Hunting
15. 15 • Machine events • Endpoint intrusion detection • Anomalies/indicators of attack or compromise • Machine learning • Training models based on patterns from historical data • Predict potential threats • profiling for adversary Identification • Stream Processing
16. Data Requirements & Description • Streaming event data • Log messages • User activity records • System ops & metrics data • Disparate data sources • Wildly differing data structures 16
17. Massive Amounts Of Data 17 • One machine can generate 2+ TB per day • Tracking millions of devices • 1 million writes per second - bursty • High % writes, lower % reads • TTL
18. RETHINKING ARCHITECTURE 18
19. WE DON'T HAVE TO LOOK FAR TO LOOK BACK 19 Rethinking Architecture
20. 20 Most batch analytics flow from several years ago looked like...
21. STREAMING & DATA SCIENCE 21 Rethinking Architecture
22. Streaming I need fast access to historical data on the fly for predictive modeling with real time data from the stream. 22
23. Not A Stream, A Flood • Data emitters • Netflix: 1 - 2 million events per second at peak • 750 billion events per day • LinkedIn: > 500 billion events per day • Data ingesters • Netflix: 50 - 100 billion events per day • LinkedIn: 2.5 trillion events per day • 1 Petabyte of streaming data 23
24. Which Translates To • Do it fast • Do it cheap • Do it at scale 24
25. Challenges • Code changes at runtime • Distributed Data Consistency • Ordering guarantees • Complex compute algorithms 25
26. Oh, and don't lose data 26
27. Strategies • Partition For Scale & Data Locality • Replicate For Resiliency • Share Nothing • Fault Tolerance • Asynchrony • Async Message Passing • Memory Management 27 • Data lineage and reprocessing in runtime • Parallelism • Elastically Scale • Isolation • Location Transparency
28. AND THEN WE GREEKED OUT 28 Rethinking Architecture
29. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. 29
30. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. • An approach • Coined by Nathan Marz • This was a huge stride forward 30
31. 31 https://www.mapr.com/developercentral/lambda-architecture
32. Implementing Is Hard 33 • Real-time pipeline backed by KV store for updates • Many moving parts - KV store, real time, batch • Running similar code in two places • Still ingesting data to Parquet/HDFS • Reconcile queries against two different places
33. Performance Tuning & Monitoring: on so many systems 34 Also hard
34. Lambda Architecture An immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. 35
35. WAIT, DUAL SYSTEMS? 36 Challenge Assumptions
36. Which Translates To • Performing analytical computations & queries in dual systems • Implementing transformation logic twice • Duplicate Code • Spaghetti Architecture for Data Flows • One Busy Network 37
37. Why Dual Systems? • Why is a separate batch system needed? • Why support code, machines and running services of two analytics systems? 38 Counter productive on some level?
38. YES 39 • A unified system for streaming and batch • Real-time processing and reprocessing • Code changes • Fault tolerance http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html - Jay Kreps
39. ANOTHER ASSUMPTION: ETL 40 Challenge Assumptions
40. Extract, Transform, Load (ETL) 41 "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
41. Extract, Transform, Load (ETL) 42 ETL involves • Extraction of data from one system into another • Transforming it • Loading it into another system
42. Extract, Transform, Load (ETL) "Designing and maintaining the ETL process is often considered one of the most difficult and resource- intensive portions of a data warehouse project." http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm 43 Also unnecessarily redundant and often typeless
43. ETL 44 • Each ETL step can introduce errors and risk • Can duplicate data after failover • Tools can cost millions of dollars • Decreases throughput • Increased complexity
44. ETL • Writing intermediary files • Parsing and re-parsing plain text 45
45. And let's duplicate the pattern over all our DataCenters 46
46. 47 These are not the solutions you're looking for
47. REVISITING THE GOAL & THE STACK 48
48. Removing The 'E' in ETL Thanks to technologies like Avro and Protobuf we don’t need the “E” in ETL. Instead of text dumps that you need to parse over multiple systems: Scala & Avro (e.g.) • Can work with binary data that remains strongly typed • A return to strong typing in the big data ecosystem 49
49. Removing The 'L' in ETL If data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load". • From there each consumer can do their own transformations 50
50. #NoMoreGreekLetterArchitectures 51
51. NoETL 52
52. Strategy Technologies Scalable Infrastructure / Elastic Spark, Cassandra, Kafka Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence Failure Detection Cassandra, Spark, Akka, Kafka Consensus & Gossip Cassandra & Akka Cluster Parallelism Spark, Cassandra, Kafka, Akka Asynchronous Data Passing Kafka, Akka, Spark Fast, Low Latency, Data Locality Cassandra, Spark, Kafka Location Transparency Akka, Spark, Cassandra, Kafka My Nerdy Chart 53
53. SMACK • Scala/Spark • Mesos • Akka • Cassandra • Kafka 54
54. Spark Streaming 55
55. Spark Streaming • One runtime for streaming and batch processing • Join streaming and static data sets • No code duplication • Easy, flexible data ingestion from disparate sources to disparate sinks • Easy to reconcile queries against multiple sources • Easy integration of KV durable storage 56
56. How do I merge historical data with data in the stream? 57
57. Join Streams With Static Data val ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint") val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table) ssc.start() 58
58. Training Data Feature Extraction Model Training Model Testing Test Data Your Data Extract Data To Analyze Train your model to predict Spark MLLib 59
59. Spark Streaming & ML 60 val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))
60. Apache Mesos Open-source cluster manager developed at UC Berkeley. Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. 61
61. Akka High performance concurrency framework for Scala and Java • Fault Tolerance • Asynchronous messaging and data processing • Parallelization • Location Transparency • Local / Remote Routing • Akka: Cluster / Persistence / Streams 62
62. Akka Actors A distribution and concurrency abstraction • Compute Isolation • Behavioral Context Switching • No Exposed Internal State • Event-based messaging • Easy parallelism • Configurable fault tolerance 63
63. 64 Akka Actor Hierarchy http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala
64. import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy { val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature") val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation") override def preStart(): Unit = { /* lifecycle hook: init */ } def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } } 65
65. Apache Cassandra • Extremely Fast • Extremely Scalable • Multi-Region / Multi-Datacenter • Always On • No single point of failure • Survive regional outages • Easy to operate • Automatic & configurable replication 66
66. Apache Cassandra • Very flexible data modeling (collections, user defined types) and changeable over time • Perfect for ingestion of real time / machine data • Huge community 67
67. Spark Cassandra Connector • NOSQL JOINS! • Write & Read data between Spark and Cassandra • Compatible with Spark 1.4 • Handles Data Locality for Speed • Implicit type conversions • Server-Side Filtering - SELECT, WHERE, etc. • Natural Timeseries Integration 68 http://github.com/datastax/spark-cassandra-connector
68. KillrWeather 69 http://github.com/killrweather/killrweather A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments. http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/ databricks/apps/weather
69. 70 • High Throughput Distributed Messaging • Decouples Data Pipelines • Handles Massive Data Load • Support Massive Number of Consumers • Distribution & partitioning across cluster nodes • Automatic recovery from broker failures
70. Spark Streaming & Kafka val context = new StreamingContext(conf, Seconds(1)) val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _) wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing 71
71. 72 class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) {  import settings._   val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](  ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)  .map(_._2.split(","))  .map(RawWeatherData(_))    kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)  /** RawWeatherData: wsid, year, month, day, oneHourPrecip */  kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))  .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)    /** Now the [[StreamingContext]] can be started. */  context.parent ! OutputStreamInitialized    def receive : Actor.Receive = {…} } Gets the partition key: Data Locality Spark C* Connector feeds this to Spark Cassandra Counter column in our schema, no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.
72. 73 /** For a given weather station, calculates annual cumulative precip - or year to date. */  class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {    def receive : Actor.Receive = {  case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)  case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)  }    /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */  def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =  ssc.cassandraTable[Double](keyspace, dailytable)  .select("precipitation")  .where("wsid = ? AND year = ?", wsid, year)  .collectAsync()  .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester    /** Returns the 10 highest temps for any station in the `year`. */  def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {  val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,  ssc.sparkContext.parallelize(aggregate).top(k).toSeq)    ssc.cassandraTable[Double](keyspace, dailytable)  .select("precipitation")  .where("wsid = ? AND year = ?", wsid, year)  .collectAsync().map(toTopK) pipeTo requester  }  }
73. A New Approach • One Runtime: streaming, scheduled • Simplified architecture • Allows us to • Write different types of applications • Write more type safe code • Write more reusable code 74
74. Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.
75. FiloDB Distributed, columnar database designed to run very fast analytical queries • Ingest streaming data from many streaming sources • Row-level, column-level operations and built in versioning offer greater flexibility than file-based technologies • Currently based on Apache Cassandra & Spark • github.com/tuplejump/FiloDB 76
76. FiloDB • Breakthrough performance levels for analytical queries • Performance comparable to Parquet • One to two orders of magnitude faster than Spark on Cassandra 2.x • Versioned - critical for reprocessing logic/code changes • Can simplify your infrastructure dramatically • Queries run in parallel in Spark for scale-out ad-hoc analysis • Space-saving techniques 77
77. WRAPPING UP 78
78. Architectyr? 79 "This is a giant mess" - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps
79. 80 Simplified
80. 81
81. 82 www.tuplejump.com info@tuplejump.com@tuplejump
82. 83 @helenaedelson github.com/helena slideshare.net/helenaedelson THANK YOU!
83. I'm speaking at QCon SF on the broader topic of Streaming at Scale http://qconsf.com/sf2015/track/streaming-data-scale 84

Visibility Others can see my Clipboard

sstable

cassandra

spark

Spark and Cassandra’s SSTable loader

Arunkumar

11/1/2024

analytics

cassandra

spark

GitHub - apache/cassandra-analytics: Apache cassandra

apache

9/4/2024

cassandra

event.driven

spark

Build an Event-Driven Architecture with Apache Kafka, Apache Spark, and Apache Cassandra

DataStax

8/3/2024

analytics

streaming

visualization

Keen - Event Streaming Platform

John Doe

2/3/2024

mongo

cassandra

kafka

Top 10 Real-Time Databases to Use in 2024

K.sabreena

1/5/2024

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

cassandra

python

kafka

GitHub - princebhatt9588/Stock_Market_Real_Time_Data_Pipeline_Project_with-Apache-Kafka-and-Cassandra: This app utilizes Python, Apache Kafka, and Cassandra to fetch and process real-time stock market data, providing valuable insights for investors and traders.

princebhatt9588

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

John Doe

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

John Doe

2/17/2023

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us