Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

by Helena Edelson

@helenaedelson Helena Edelson Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

• Spark Cassandra Connector committer• Akka contributor - 2 new features in Akka Cluster• Big Data & Scala conference speaker• Currently Sr Software Engineer, Analytics @ DataStax• Sr Cloud Engineer, VMware,CrowdStrike,SpringSource…• Prev Spring committer - Spring AMQP, Spring Integration

Talk RoadmapWhat Lambda Architecture & Delivering MeaningWhy Spark, Kafka, Cassandra & Akka integrationHow Composable Pipelines - Code

I need fast accessto historical dataon the fly forpredictive modelingwith real time datafrom the stream

Lambda ArchitectureA data-processing architecture designed to handle massive quantities ofdata by taking advantage of both batch and stream processing methods.• Spark is one of the few data processing frameworks that allows you toseamlessly integrate batch and stream processing• Of petabytes of data• In the same application

Your Code

Moving Data Between Systems IsDifficult Risky and Expensive

How Do We Approach This?

Strategies• Scalable Infrastructure• Partition For Scale• Replicate For Resiliency• Share Nothing• Asynchronous Message Passing• Parallelism• Isolation• Data Locality• Location Transparency

Strategy TechnologiesScalable Infrastructure / Elastic Spark, Cassandra, KafkaPartition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka ClusterReplicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ringShare Nothing, Masterless Cassandra, Akka Cluster both Dynamo styleFault Tolerance / No Single Point of Failure Spark, Cassandra, KafkaReplay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka PersistenceFailure Detection Cassandra, Spark, Akka, KafkaConsensus & Gossip Cassandra & Akka ClusterParallelism Spark, Cassandra, Kafka, AkkaAsynchronous Data Passing Kafka, Akka, SparkFast, Low Latency, Data Locality Cassandra, Spark, KafkaLocation Transparency Akka, Spark, Cassandra, Kafka

Apache Spark• Fast, distributed, scalable andfault tolerant cluster computesystem• Enables Low-latency withcomplex analytics• Developed in 2009 at UCBerkeley AMPLab, open sourcedin 2010• Became an Apache project inFebruary, 2014

Apache Kafka• High Throughput Distributed Messaging• Decouples Data Pipelines• Handles Massive Data Load• Support Massive Number of Consumers• Distribution & partitioning across cluster nodes• Automatic recovery from broker failures

Speaking Of Fault Tolerance…

The one thing in your infrastructureyou can always rely on.

Availability"During Hurricane Sandy, we lost an entire data center. Completely. Lost. It.Our data in Cassandra never went offline."

Apache Cassandra•Massively Scalable• High Performance• Always On• Masterless

Akka• Fault tolerant• Hierarchical Supervision• Customizable Failure Strategies & Detection• Asynchronous Data Passing• Parallelization - Balancing Pool Routers• Akka Cluster• Adaptive / Predictive• Load-Balanced Across Cluster Nodes

I've used Scalawith theseevery single time.

Integration• Stream data from Kafka to Cassandra• Stream data from Kafka to Spark and write to Cassandra• Stream from Cassandra to Spark - coming soon!• Read data from Spark/Spark Streaming Source and write to C*• Read data from Cassandra to Spark

Apache Spark• Distributed Analytics Platform• Easy Abstraction for Datasets• Support in several languages• Streaming• Machine Learning• Graph• Integrated SQL Queries• Has Generalized DAG executionAll in one packageAnd it uses Akka

Most Active OSS In Big Data

Apache Spark - Easy to Use APIReturns the top (k) highest temps for any location in the yeardef topK(aggregate: Seq[Double]): Seq[Double] =sc.parallelize(aggregate).top(k).collectReturns the top (k) highest temps … in a Futuredef topK(aggregate: Seq[Double]): Future[Seq[Double]] =sc.parallelize(aggregate).top(k).collectAsync

Use the Spark Shell toquickly try out code samplesAvailable inScala andPyspark

Collection To RDDscala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distributedData = sc.parallelize(data)
distributedData: spark.RDD[Int] =spark.ParallelCollection@10d13e3e 25. © 2014 DataStax, All Rights Reserved Company ConfidentialNot Just MapReduce 26. Spark Basic Word Countval conf = new SparkConf().setMaster(host).setAppName(app)
val sc = new SparkContext(conf)sc.textFile(words).flatMap(_.split("s+")).map(word => (word.toLowerCase, 1)).reduceByKey(_ + _).collectAnalyticAnalyticSearch 27. RDDs Can be Generated from aVariety of SourcesTextfilesScala Collections 28. AnalyticAnalyticSearchTransformationActionRDD Operations 29. Setting up C* and SparkDSE > 4.5.0Just start your nodes withdse  cassandra  -­‐kApache CassandraFollow the excellent guide by Al Tobey 30. When Batch Is Not EnoughAnalyticAnalytic 31. AnalyticAnalyticSearchYour Data Is Like CandyDelicious: you want it now 32. AnalyticAnalyticSearchYour Data Is Like CandyDelicious: you want it nowBatch AnalyticsAnalysis after data has accumulatedDecreases the weight of the data by the time it is processedStreaming AnalyticsAnalytics as data arrives.The data won’t be stale and neither will our analyticsBoth in same app = Lambda 33. Spark Streaming• I want results continuously in the event stream• I want to run computations in my even-driven async apps• Exactly once message guarantees 34. DStream (Discretized Stream)RDD (time 0 to time 1) RDD (time 1 to time 2) RDD (time 2 to time 3)A transformation on a DStream = transformations on its RDDsDStreamContinuous stream of micro batches• Complex processing models with minimal effort• Streaming computations on small time intervals 35. val conf = new SparkConf().setMaster(SparkMaster).setAppName(AppName)val ssc = new StreamingContext(conf, Milliseconds(500))ssc.textFileStream("s3n://raw_data_bucket/").flatMap(_.split("s+")).map(_.toLowerCase, 1)).countByValue().saveToCassandra(keyspace,table)ssc.checkpoint(checkpointDir)ssc.start()ssc.awaitTerminationStarts the streaming application pipingraw incoming data to a SinkThe batch streaming intervalBasic Streaming: FileInputDStream 36. DStreams - the stream of raw data received from streaming sources:• Basic Source - in the StreamingContext API• Advanced Source - in external modules and separate Spark artifactsReceivers• Reliable Receivers - for data sources supporting acks (like Kafka)• Unreliable Receivers - for data sources not supporting acks39ReceiverInputDStreams 37. Spark Streaming External Source/Sink 38. Streaming Window OperationskvStream.flatMap { case (k,v) => (k,v.value) }.reduceByKeyAndWindow((a:Int,b:Int) =>(a + b), Seconds(30), Seconds(10)).saveToCassandra(keyspace,table)Window Length:Duration = every 10sSliding Interval:Interval at which the window operationis performed = every 10 s 39. ScaleApache Cassandra• Scales Linearly to as many nodes as you need• Scales whenever you need 40. PerformanceApache Cassandra• It’s Fast• Built to sustain massive data insertion rates inirregular pattern spikes 41. FaultTolerance&AvailabilityApache Cassandra• Automatic Replication• Multi Datacenter• Decentralized - no single point of failure• Survive regional outages• New nodes automatically add themselves tothe cluster• DataStax drivers automatically discover newnodes 42. © 2014 DataStax, All Rights Reserved Company Confidential47ACDABCABDBCDACDABCABDBCDUS-East EuropeHow many copies of adata should exist in the cluster?ReplicationFactor=3A BC DFault Tolerance & Replication 43. © 2014 DataStax, All Rights Reserved Company Confidential48Cassandra ClusterACDABCABDBCDACDABCABDBCDEuropeReplicationFactor=3US-EastA BC DFault Tolerance & ReplicationHow many copies of adata should exist in the cluster? 44. StrategiesApache Cassandra• Consensus - Paxos Protocol• Sequential Read / Write - Timeseries• Tunable Consistency• Gossip:Did you hear node 1was down?? 45. ArchitectureApache Cassandra• Distributed, Masterless Ring Architecture• Network Topology Aware• Flexible, Schemaless - your data structure can evolveseamlessly over time 46. C* At CERN: Large Haldron Colider•ATLAS - Largest of several detectors along the Large Hadron Collider• Measures particle production when protons collide at a very highcenter of mass energy•- Bursty traffic•- Volume of data from sensors requires• - Very large trigger and data acquisition system• - 30,000 applications on 2,000 nodes 47. Genetics / Biological Computations 48. IoT 49. CREATE TABLE users (username varchar,firstname varchar,lastname varchar,email list<varchar>,password varchar,created_date timestamp,PRIMARY KEY (username));INSERT INTO users (username, firstname, lastname,email, password, created_date)VALUES ('hedelson','Helena','Edelson',[‘'],'ba27e03fd95e507daf2937c937d499ab','2014-11-15 13:50:00’)IF NOT EXISTS;• Familiar syntax• Many Tools & Drivers• Many Languages• Friendly to programmers• Paxos for lockingCQL - Easy 50. CREATE  TABLE  weather.raw_data  (
      wsid  text,  year  int,  month  int,  day  int,  hour  int,                          
      temperature  double,  dewpoint  double,  pressure  double,          wind_direction  int,  wind_speed  double,  one_hour_precip              PRIMARY  KEY  ((wsid),  year,  month,  day,  hour)
)  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC,  hour  DESC);  C* Clustering Columns Writes by most recentReads return most recent firstTimeseries DataCassandra will automatically sort by most recent for both write and read 51. val multipleStreams = (1 to numDstreams).map { i =>streamingContext.receiverStream[HttpRequest](new HttpReceiver(port))}streamingContext.union(multipleStreams).map { httpRequest => TimelineRequestEvent(httpRequest)}.saveToCassandra("requests_ks", "timeline")A record of every event, in order in which it happened, per URL:CREATE TABLE IF NOT EXISTS requests_ks.timeline (timesegment bigint, url text, t_uuid timeuuid, method text, headers map <text, text>, body text,PRIMARY KEY ((url, timesegment) , t_uuid));timeuuid protects from simultaneous events over-writing one another.timesegment protects from writing unbounded partitions. 52. Spark Cassandra Connector59@helenaedelson 53. Spark Cassandra Connector•NOSQL JOINS!•Write & Read data between Spark and Cassandra•Compatible with Spark 1.3•Handles Data Locality for Speed•Implicit type conversions•Server-Side Filtering - SELECT, WHERE, etc.•Natural Timeseries Integration 54. Spark Cassandra ConnectorC*C*C*C*Spark ExecutorC* DriverSpark-Cassandra ConnectorUser ApplicationCassandra 55. AnalyticSearchWriting and ReadingSparkContextimport  com.datastax.spark.connector._  StreamingContext  import  com.datastax.spark.connector.streaming._ 56. AnalyticWrite from Spark to Cassandrasc.parallelize(Seq(0,1,2)).saveToCassandra(“keyspace”,  "raw_data")SparkContext Keyspace TableSpark RDD JOIN with NOSQL!predictionsRdd.join(music).saveToCassandra("music",  "predictions") 57. Read From C* to Sparkval  rdd  =  sc.cassandraTable("github",  "commits")                                            .select("user","count","year","month")                                            .where("commits  >=  ?  and  year  =  ?",  1000,  2015)CassandraRDD[CassandraRow]Keyspace TableServer-Side Columnand Row FilteringSparkContext 58. val  rdd  =  ssc.cassandraTable[MonthlyCommits]("github",  "commits_aggregate")                            .where("user  =  ?  and  project_name  =  ?  and  year  =  ?",                                    "helena",  "spark-­‐cassandra-­‐connector",  2015)CassandraRow Keyspace TableStreamingContextRows: Custom Objects 59. Rowsval  tuplesRdd  =  sc.cassandraTable[(Int,Date,String)](db,  tweetsTable)    .select("cluster_id","time",  "cluster_name")    .where("time  >  ?  and  time  <  ?",                  "2014-­‐07-­‐12  20:00:01",  "2014-­‐07-­‐12  20:00:03”)val rdd = ssc.cassandraTable[MyDataType]("stats", "clustering_time").where("key = 1").limit(10).collect  val  rdd  =  ssc.cassandraTable[(Int,DateTime,String)]("stats",  "clustering_time")                            .where("key  =  1").withDescOrder.collect   60. Cassandra User Defined TypesCREATE TYPE address (street text,city text,zip_code int,country text,cross_streets set<text>);UDT = Your Custom Field Type In Cassandra 61. Cassandra UDT’s With JSON{"productId": 2,"name": "Kitchen Table","price": 249.99,"description" : "Rectangular table with oak finish","dimensions": {"units": "inches","length": 50.0,"width": 66.0,"height": 32},"categories": {{"category" : "Home Furnishings" {"catalogPage": 45,"url": "/home/furnishings"},{"category" : "Kitchen Furnishings" {"catalogPage": 108,"url": "/kitchen/furnishings"}}}CREATE TYPE dimensions (units text,length float,width float,height float);CREATE TYPE category (catalogPage int,url text);CREATE TABLE product (productId int,name text,price float,description text,dimensions frozen <dimensions>,categories map <text, frozen <category>>,PRIMARY KEY (productId)); 62. Data Locality● Spark asks an RDD for a list of its partitions (splits)● Each split consists of one or more token-ranges● For every partition● Spark asks RDD for a list of preferred nodes to process on● Spark creates a task and sends it to one of the nodes for executionEvery Spark task uses a CQL-like query to fetch data for the given token range:C*C*C*C*SELECT  "key",  "value"  
FROM  "test"."kv"  
    token("key")  >    595597420921139321  AND  
    token("key")  <=  595597431194200132    ALLOW  FILTERING 63. All of the rows in a Cassandra Clusterare stored based based on theirlocation in the Token Range.Cassandra Locates a Row Based onPartition Key and Token Range 64. New York City/Manhattan:HelenaWarsaw:Piotr & JacekSan Francisco:Brian,Russell &AlexEach of the Nodes in a 
Cassandra Cluster is primarilyresponsible for one set ofTokens.0999500Cassandra Locates a Row Based onPartition Key and Token RangeSt. Petersburg:Artem 65. New York CityWarsawSan FranciscoEach of the Nodes in a 
Cassandra Cluster is primarilyresponsible for one set ofTokens.0999500750 - 99350 - 749100 - 349Cassandra Locates a Row Based onPartition Key and Token RangeSt. Petersburg 66. Jacek 514 RedThe CQL Schema designatesat least one column to be thePartition Key.New York CityWarsawSan FranciscoCassandra Locates a Row Based onPartition Key and Token RangeSt. Petersburg 67. Helena 514 RedThe hash of the Partition Keytells us where a rowshould be stored.New York CityWarsawSan FranciscoCassandra Locates a Row Based onPartition Key and Token RangeSt. Petersburg 68. AmsterdamSpark ExecutorThe C* Driver pages rows at a timeSELECT * FROM keyspace.table WHEREpk =The Spark Executor uses the Connector toPull Rows from the Local Cassandra Instance 69. AmsterdamSpark Executor (Superman)DataStaxEnterpriseSELECT * FROM keyspace.tableWHERE solr_query = 'title:b'ANDtoken(pk) > 780 and token(pk) <= 830Tokens 780 - 830DataStax Enterprise Enables This Same Machinery 
with Solr Pushdown 70. Composable PipelinesWith Spark, Kafka & Cassandra77@helenaedelson 71. Spark SQL with Cassandraimport org.apache.spark.sql.cassandra.CassandraSQLContextval cc = new CassandraSQLContext(sparkContext)cc.setKeyspace(keyspaceName)cc.sql("""SELECT table1.a, table1.b, table.c, table2.aFROM table1 AS table1JOIN table2 AS table2 ON table1.a = table2.aAND table1.b = table2.bAND table1.c = table2.c""").map(Data(_)).saveToCassandra(keyspace1, table3) 72. val sql = new SQLContext(sparkContext)val json = Seq(
"""{"user":"helena","commits":98, "month":3, "year":2015}""",
"""{"user":"jacek-lewandowski", "commits":72, "month":3, "year":2015}""",
"""{"user":"pkolaczk", "commits":42, "month":3, "year":2015}""")// writesql.jsonRDD(json).map(CommitStats(_)).flatMap(compute).saveToCassandra("stats","monthly_commits")
// readval rdd = sc.cassandraTable[MonthlyCommits]("stats","monthly_commits")cqlsh>  CREATE  TABLE  github_stats.commits_aggr(user  VARCHAR  PRIMARY  KEY,  commits  INT…);Spark SQL with Cassandra & JSON 73. AnalyticAnalyticSearchSpark Streaming, Kafka, C* and JSONcqlsh>  select  *  from  github_stats.commits_aggr;     user | commits | month | year-------------------+---------+-------+------pkolaczk | 42 | 3 | 2015jacek-lewandowski | 43 | 3 | 2015helena | 98 | 3 | 2015
(3  rows)  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY)
.map { case (_,json) => JsonParser.parse(json).extract[MonthlyCommits]}
.saveToCassandra("github_stats","commits_aggr") 74. Kafka Streaming Word CountsparkConf.set("", "")
val streamingContext = new StreamingContext(conf, Seconds(30))
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
streamingContext, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY).map(_._2).countByValue().saveToCassandra("my_keyspace","wordcount") 75. Spark Streaming, Twitter & Cassandra/** Cassandra is doing the sorting for you here. */
TwitterUtils.createStream(ssc, auth, tags, StorageLevel.MEMORY_ONLY_SER_2)
.countByValueAndWindow(Seconds(5), Seconds(5))
.transform((rdd, time) => { case (term, count) => (term, count, now(time))})
.saveToCassandra(keyspace, table)CREATE TABLE IF NOT EXISTS keyspace.table (
topic text, interval text, mentions counter,
PRIMARY KEY(topic, interval)
) WITH CLUSTERING ORDER BY (interval DESC) 76. TrainingDataFeatureExtractionModelTrainingModelTestingTestDataYour Data Extract Data To AnalyzeTrain your model to predictSpark MLLib 77. val ssc = new StreamingContext(new SparkConf()…, Seconds(5)
val testData = ssc.cassandraTable[String](keyspace,table).map(LabeledPoint.parse)
val trainingStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](
ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY).map(_._2).map(LabeledPoint.parse)trainingStream.saveToCassandra("ml_keyspace", “raw_training_data")
val model = new StreamingLinearRegressionWithSGD()
.trainOn(trainingStream)//Making predictions on testDatamodel.predictOnValues( => (lp.label, lp.features))).saveToCassandra("ml_keyspace", "predictions")Spark Streaming ML, Kafka & C* 78. KillrWeather• Global sensors & satellites collect data• Cassandra stores in sequence• Application reads in sequenceApacheCassandra 79. Data model should look like your queries 80. • Store raw data per ID• Store time series data in order: most recent to oldest• Compute and store aggregate data in the stream• Set TTLs on historic data• Get data by ID• Get data for a single date and time• Get data for a window of time• Compute, store and retrieve daily, monthly, annual aggregationsDesign Data Model to support queriesQueries I Need 81. Data Model• Weather Station Id and Timeare unique• Store as many as neededCREATE TABLE daily_temperature (weather_station text,year int,month int,day int,hour int,temperature double,PRIMARY KEY (weather_station,year,month,day,hour));INSERT INTO temperature(weather_station,year,month,day,hour,temperature)VALUES (‘10010:99999’,2005,12,1,7,-5.6);INSERT INTO temperature(weather_station,year,month,day,hour,temperature)VALUES (‘10010:99999’,2005,12,1,8,-5.1);INSERT INTO temperature(weather_station,year,month,day,hour,temperature)VALUES (‘10010:99999’,2005,12,1,9,-4.9);INSERT INTO temperature(weather_station,year,month,day,hour,temperature)VALUES (‘10010:99999’,2005,12,1,10,-5.3); 82. class HttpNodeGuardian extends ClusterAwareNodeGuardianActor {
cluster.joinSeedNodes(Vector(..))context.actorOf(BalancingPool(PoolSize).props(Props(new KafkaPublisherActor(KafkaHosts, KafkaBatchSendSize))))
Cluster(context.system) registerOnMemberUp {context.actorOf(BalancingPool(PoolSize).props(Props(new HttpReceiverActor(KafkaHosts, KafkaBatchSendSize))))}def initialized: Actor.Receive = { … }}Load-Balanced Data Ingestion 83. class HttpDataIngestActor(kafka: ActorRef) extends Actor with ActorLogging {
implicit val system = context.system
implicit val askTimeout: Timeout = settings.timeout
implicit val materializer = ActorFlowMaterializer(
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.POST, Uri.Path("/weather/data"), headers, entity, _) =>
headers.toSource collect { case s: Source =>
kafka ! KafkaMessageEnvelope[String, String](topic, group,*)
HttpResponse(200, entity = HttpEntity(MediaTypes.`text/html`)
}.getOrElse(HttpResponse(404, entity = "Unsupported request"))
case _: HttpRequest =>
HttpResponse(400, entity = "Unsupported request")
Http(system).bind(HttpHost, HttpPort).map { case connection =>"Accepted new connection from " + connection.remoteAddress)
connection.handleWithSyncHandler(requestHandler) }def receive : Actor.Receive = {
case e =>
}Client: HTTP Receiver Akka Actor 84. class KafkaProducerActor[K, V](config: ProducerConfig) extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
case _: ActorInitializationException => Stop
case _: FailedToSendMessageException => Restartcase _: ProducerClosedException => Restartcase _: NoBrokersForPartitionException => Escalatecase _: KafkaException => Escalate
case _: Exception => Escalate
}private val producer = new KafkaProducer[K, V](producerConfig)
override def postStop(): Unit = producer.close()def receive = {
case e: KafkaMessageEnvelope[K,V] => producer.send(e)
}Client: Kafka Producer Akka Actor 85. Store raw data on ingestion 86. val kafkaStream = KafkaUtils.createStream[K, V, KDecoder, VDecoder](ssc, kafkaParams, topicMap, StorageLevel.DISK_ONLY_2)
/** Saves the raw data to Cassandra. */
kafkaStream.saveToCassandra(keyspace, raw_ws_data)Store Raw Data From Kafka Stream To C*/** Now proceed with computations from the same stream.. */kafkaStream…Now we can replay on failurefor later computation, etc 87. CREATE  TABLE  weather.raw_data  (
      wsid  text,  year  int,  month  int,  day  int,  hour  int,                          
      temperature  double,  dewpoint  double,  pressure  double,          wind_direction  int,  wind_speed  double,  one_hour_precip              PRIMARY  KEY  ((wsid),  year,  month,  day,  hour)
)  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC,  hour  DESC);  CREATE  TABLE  daily_aggregate_precip  (
      wsid  text,
      year  int,
      month  int,
      day  int,
      precipitation  counter,
      PRIMARY  KEY  ((wsid),  year,  month,  day)
)  WITH  CLUSTERING  ORDER  BY  (year  DESC,  month  DESC,  day  DESC);Let’s See Our Data Model Again 88. Gets the partition key: Data LocalitySpark C* Connector feeds this to SparkCassandra Counter column in our schema,no expensive `reduceByKey` needed. Simplylet C* do it: not expensive and fast.Efficient Stream Computationclass KafkaStreamingActor(kafkaPm: Map[String, String], ssc: StreamingContext, ws: WeatherSettings)extends AggregationActor {
import settings._val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
/** RawWeatherData: wsid, year, month, day, oneHourPrecip */ => (hour.wsid, hour.year, hour.month,, hour.oneHourPrecip))
.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
/** Now the [[StreamingContext]] can be started. */
context.parent ! OutputStreamInitialized
def receive : Actor.Receive = {…}} 89. /** For a given weather station, calculates annual cumulative precip - or year to date. */
class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
def receive : Actor.Receive = {
case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
ssc.cassandraTable[Double](keyspace, dailytable)
.where("wsid = ? AND year = ?", wsid, year)
.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
/** Returns the 10 highest temps for any station in the `year`. */
def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
ssc.cassandraTable[Double](keyspace, dailytable)
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync().map(toTopK) pipeTo requester
} 90. class TemperatureActor(sc: SparkContext, settings: WeatherSettings)extends AggregationActor {
import akka.pattern.pipedef receive: Actor.Receive = {
case e: GetMonthlyHiLowTemperature => highLow(e, sender)
def highLow(e: GetMonthlyHiLowTemperature, requester: ActorRef): Unit =
sc.cassandraTable[DailyTemperature](keyspace, daily_temperature_aggr)
.where("wsid = ? AND year = ? AND month = ?", e.wsid, e.year, e.month)
.map(MonthlyTemperature(_, e.wsid, e.year, e.month)) pipeTo requester}C* data is automatically sorted by most recent - due to our data model.Additional Spark or collection sort not needed.Efficient Batch Analytics 91. 92. Learn More Online and at Cassandra Summit Recommended Visual Thinking StrategiesOnline Course - LinkedIn Learning Learning Management Systems (LMS) Quick StartOnline Course - LinkedIn Learning PowerPoint: From Outline to PresentationOnline Course - LinkedIn Learning Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson How to deploy Apache Spark 
