Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

12/2/2020

Reading time:8 min

Cassandra and Spark, closing the gap between no sql and analytics c…

by Duyhai Doan

Cassandra and Spark, closing the gap between no sql and analytics c… SlideShare Explore You Successfully reported this slideshow.Cassandra and Spark, closing the gap between no sql and analytics codemotion berlin 2015Upcoming SlideShareLoading in …5× 6 Comments 3 Likes Statistics Notes Victor Coustenoble , Technical regional manager EMEA at Trifacta jean pasqualini , Développeur Certifié Symfony 3 chez Wynd at Wynd Mehdi TAZI , Consultant Architecte Big Data - Java JEE / PhD - IoT Virtualization on Cloud at AXA DIL No DownloadsNo notes for slide 1. @doanduyhaiCassandra & Spark, closing the gapbetween NoSQL and analyticsDuyHai DOAN, Technical Advocate 2. @doanduyhaiWho Am I ?Duy Hai DOANCassandra technical advocate•  talks, meetups, confs•  open-source devs (Achilles, …)•  OSS Cassandra point of contact☞ duy_hai.doan@datastax.com☞ @doanduyhai2 3. @doanduyhaiDatastax•  Founded in April 2010•  We contribute a lot to Apache Cassandra™•  400+ customers (25 of the Fortune 100), 400+ employees•  Headquarter in San Francisco Bay area•  EU headquarter in London, offices in France and Germany•  Datastax Enterprise = OSS Cassandra + extra features3 4. @doanduyhaiSpark – Cassandra Use CasesLoad data from varioussourcesAnalytics (join, aggregate, transform, …)Sanitize, validate, normalize, transform dataSchema migration,Data conversion4 5. Spark & Cassandra PresentationSpark & its eco-systemCassandra Quick Recap 6. @doanduyhaiWhat is Apache Spark ?Created atApache Project since 2010General data processing frameworkFaster than Hadoop, in memoryOne-framework-many-components approach6 7. @doanduyhaiPartitions transformationsmap(tuple => (tuple._3, tuple))Direct transformationShuffle (expensive !)groupByKey()countByKey()partitionRDDFinal action7 8. @doanduyhaiSpark eco-systemLocal Standalone cluster YARN MesosSpark Core Engine (Scala/Java/Python)Spark Streaming MLLibGraphXSpark SQLPersistenceCluster Manageretc…8 9. @doanduyhaiSpark eco-systemLocal Standalone cluster YARN MesosSpark Core Engine (Scala/Java/Python)Spark Streaming MLLibGraphXSpark SQLPersistenceCluster Manageretc…9 10. @doanduyhaiWhat is Apache Cassandra?Created atApache Project since 2009Distributed NoSQL databaseEventual consistencyDistributed table abstraction10 11. @doanduyhaiCassandra data distribution reminderRandom: hash of #partition → token = hash(#p)Hash: ]-X, X]X = huge number (264/2)n1n2n3n4n5n6n7n811 12. @doanduyhaiCassandra token rangesA: ]0, X/8]B: ] X/8, 2X/8]C: ] 2X/8, 3X/8]D: ] 3X/8, 4X/8]E: ] 4X/8, 5X/8]F: ] 5X/8, 6X/8]G: ] 6X/8, 7X/8]H: ] 7X/8, X]Murmur3 hash functionn1n2n3n4n5n6n7n8ABCDEFGH12 13. @doanduyhaiLinear scalabilityn1n2n3n4n5n6n7n8ABCDEFGHuser_id1user_id2user_id3user_id4user_id513 14. @doanduyhaiLinear scalabilityn1n2n3n4n5n6n7n8ABCDEFGHuser_id1user_id2user_id3user_id4user_id514 15. @doanduyhaiCassandra Query Language (CQL)INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);UPDATE users SET age = 34 WHERE login = ‘jdoe’;DELETE age FROM users WHERE login = ‘jdoe’;SELECT age FROM users WHERE login = ‘jdoe’;15 16. Spark & Cassandra ConnectorSpark Core APISparkSQL/DataFrameSpark Streaming 17. @doanduyhaiSpark/Cassandra connector architectureAll Cassandra types supported and converted to Scala typesServer side data filtering (SELECT … WHERE …)Use Java-driver underneathScala and Java support. Python support via PySpark (exp.)17 18. @doanduyhaiConnector architecture – Core APICassandra tables exposed as Spark RDDsRead from and write to CassandraMapping of C* tables and rows to Scala objects•  CassandraRDD and CassandraRow•  Scala case class (object mapper)•  Scala tuples18 19. @doanduyhaiSpark Corehttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 20. @doanduyhaiConnector architecture – DataFrameMapping of Cassandra table to DataFrame•  CassandraSQLContext à org.apache.spark.sql.SQLContext•  CassandraSQLRow à org.apache.spark.sql.catalyst.expressions.Row•  Mapping of Cassandra types to Catalyst types•  CassandraCatalog à Catalog (used by Catalyst Analyzer)20 21. @doanduyhaiConnector architecture – DataFrameMapping of Cassandra table to SchemaRDD•  CassandraSourceRelation•  extends BaseRelation with InsertableRelation with PruntedFilteredScan•  custom query plan•  push predicates to CQL for early filtering (if possible)SELECT * FROM user_emails WHERE login = ‘jdoe’;21 22. @doanduyhaiSpark SQLhttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 23. @doanduyhaiConnector architecture – Spark StreamingStreaming data INTO Cassandra table•  trivial setup•  be careful about your Cassandra data model when having an infinitestream !!!Streaming data OUT of Cassandra tables (CASSANDRA-8844) ?•  notification system (publish/subscribe)•  at-least-once delivery semantics•  work in progress …23 24. @doanduyhaiSpark Streaminghttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 25. @doanduyhaiQ & R! " 26. Spark/Cassandra operationsCluster deployment & job lifecycleData locality 27. @doanduyhaiCluster deploymentC*SparkMSparkWC*SparkWC*SparkWC*SparkWC*SparkWStand-alone cluster27 28. @doanduyhaiCassandra – Spark placementSpark Worker Spark Worker Spark Worker Spark Worker1 Cassandra process ⟷ 1 Spark workerC* C* C* C*28Spark Master 29. @doanduyhaiCassandra – Spark job lifecycleSpark Worker Spark Worker Spark Worker Spark WorkerC* C* C* C*29Spark MasterSpark ClientDriver ProgramSpark Context1D e f i n e y o u rbusiness logichere ! 30. @doanduyhaiCassandra – Spark job lifecycleSpark Worker Spark Worker Spark Worker Spark WorkerC* C* C* C*30Spark MasterSpark ClientDriver ProgramSpark Context2 31. @doanduyhaiCassandra – Spark job lifecycleSpark Worker Spark Worker Spark Worker Spark WorkerC* C* C* C*31Spark MasterSpark ClientDriver ProgramSpark Context3 333 32. @doanduyhaiCassandra – Spark job lifecycleSpark Worker Spark Worker Spark Worker Spark WorkerC* C* C* C*32Spark MasterSpark ClientDriver ProgramSpark ContextSpark Executor Spark Executor Spark Executor Spark Executor444 4 33. @doanduyhaiCassandra – Spark job lifecycleSpark Worker Spark Worker Spark Worker Spark WorkerC* C* C* C*33Spark MasterSpark ClientDriver ProgramSpark ContextSpark Executor Spark Executor Spark Executor Spark Executor5 5 5 5 34. @doanduyhaiData Locality – Cassandra token rangesA: ]0, X/8]B: ] X/8, 2X/8]C: ] 2X/8, 3X/8]D: ] 3X/8, 4X/8]E: ] 4X/8, 5X/8]F: ] 5X/8, 6X/8]G: ] 6X/8, 7X/8]H: ] 7X/8, X]n1n2n3n4n5n6n7n8ABCDEFGH34 35. @doanduyhaiData Locality – How ToC*SparkMSparkWC*SparkWC*SparkWC*SparkWC*SparkWSpark partition RDDCassandratokens ranges35 36. @doanduyhaiData Locality – How ToC*SparkMSparkWC*SparkWC*SparkWC*SparkWC*SparkWUse Murmur3Partitioner36 37. @doanduyhaiRead data localityRead from Cassandra37 38. @doanduyhaiRead data localitySpark shuffle operations38 39. @doanduyhaiWrite to Cassandra without data localityAsync batches fan-out writes to CassandraBecause of shuffle, original data locality is lost39 40. @doanduyhaiWrite to Cassandra with data localityWrite to Cassandrardd.repartitionByCassandraReplica("keyspace","table")40 41. @doanduyhaiWrite data locality•  either stream data in Spark layer using repartitionByCassandraReplica()•  or flush data to Cassandra by async batches•  in any case, there will be data movement on network (sorry no magic)41 42. @doanduyhaiJoins with data localityCREATE TABLE artists(name text, style text, … PRIMARY KEY(name));CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));val join: CassandraJoinRDD[(String,Int), (String,String)] =sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)// Select only useful columns for join and processing.select("artist","year").as((_:String, _:Int))// Repartition RDDs by "artists" PK, which is "name".repartitionByCassandraReplica(KEYSPACE, ARTISTS)// Join with "artists" table, selecting only "name" and "country" columns.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")).on(SomeColumns("name"))42 43. @doanduyhaiJoins pipeline with data localityLOCAL READFROM CASSANDRA43 44. @doanduyhaiJoins pipeline with data localityREPARTITION TO MAPCASSANDRA REPLICA44 45. @doanduyhaiJoins pipeline with data localityJOIN WITHDATA LOCALITY45 46. @doanduyhaiPerfect data locality scenario•  read localy from Cassandra•  use operations that do not require shuffle in Spark (map, filter, …)•  repartitionbyCassandraReplica()•  à to a table having same partition key as original table•  save back into this Cassandra tableSanitize, validate, normalize, transform dataUSE CASE46 47. Spark/Cassandra use-case demosData cleaningSchema migrationAnalytics 48. @doanduyhaiUse CasesLoad data from varioussourcesAnalytics (join, aggregate, transform, …)Sanitize, validate, normalize, transform dataSchema migration,Data conversion48 49. @doanduyhaiData cleaning use-casesBug in your application ?Dirty input data ?☞ Spark job to clean it up! (perfect data locality)Sanitize, validate, normalize, transform data49 50. @doanduyhaiData Cleaninghttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 51. @doanduyhaiSchema migration use-casesBusiness requirements change with time ?Current data model no longer relevant ?☞ Spark job to migrate data !Schema migration,Data conversion51 52. @doanduyhaiData Migrationhttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 53. @doanduyhaiAnalytics use-casesGiven existing tables of performers and albums, I want to :•  count the number of albums releases by decade (70’s, 80’s, 90’s, …)☞ Spark job to compute analytics !Analytics (join, aggregate, transform, …)53 54. @doanduyhaiAnalytics pipeline①  Read from production transactional tables②  Perform aggregation with Spark③  Save back data into dedicated tables for fast visualization④  Repeat step ①54 55. @doanduyhaiAnalyticshttps://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData 56. @doanduyhaiThank You@doanduyhaiduy_hai.doan@datastax.comhttps://academy.datastax.com/ Recommended Spark cassandra integration, theory and practiceDuyhai Doan Apache Zeppelin @DevoxxFR 2016Duyhai Doan Spark Cassandra 2016Duyhai Doan KillrChat presentationDuyhai Doan Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan Cassandra introduction @ NantesJUGDuyhai Doan Fast track to getting started with DSE Max @ INGDuyhai Doan Cassandra introduction @ ParisJUGDuyhai Doan Cassandra drivers and librariesDuyhai Doan Cassandra introduction mars jugDuyhai Doan About Blog Terms Privacy Copyright × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Cassandra and Spark, closing the gap between no sql and analytics c…

Successfully reported this slideshow.

Cassandra and Spark, closing the gap between no sql and analytics codemotion berlin 2015
@doanduyhai
Cassandra & Spark, closing the gap
between NoSQL and analytics
DuyHai DOAN, Technical Advocate
@doanduyhai
Who Am I ?
Duy Hai DOAN
Cassandra technical advocate
•  talks, meetups, confs
•  open-source devs (Achilles, …...
@doanduyhai
Datastax
•  Founded in April 2010
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the For...
@doanduyhai
Spark – Cassandra Use Cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize,...
Spark & Cassandra Presentation
Spark & its eco-system
Cassandra Quick Recap
@doanduyhai
What is Apache Spark ?
Created at
Apache Project since 2010
General data processing framework
Faster than Hado...
@doanduyhai
Partitions transformations
map(tuple => (tuple._3, tuple))
Direct transformation
Shuffle (expensive !)
groupBy...
@doanduyhai
Spark eco-system
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLL...
@doanduyhai
Spark eco-system
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLL...
@doanduyhai
What is Apache Cassandra?
Created at
Apache Project since 2009
Distributed NoSQL database
Eventual consistency...
@doanduyhai
Cassandra data distribution reminder
Random: hash of #partition → token = hash(#p)
Hash: ]-X, X]
X = huge numb...
@doanduyhai
Cassandra token ranges
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/...
@doanduyhai
Linear scalability
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
13
@doanduyhai
Linear scalability
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
14
@doanduyhai
Cassandra Query Language (CQL)
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE user...
Spark & Cassandra Connector
Spark Core API
SparkSQL/DataFrame
Spark Streaming
@doanduyhai
Spark/Cassandra connector architecture
All Cassandra types supported and converted to Scala types
Server side ...
@doanduyhai
Connector architecture – Core API
Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapp...
@doanduyhai
Spark Core
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Connector architecture – DataFrame
Mapping of Cassandra table to DataFrame
•  CassandraSQLContext à org.apache...
@doanduyhai
Connector architecture – DataFrame
Mapping of Cassandra table to SchemaRDD
•  CassandraSourceRelation
•  exten...
@doanduyhai
Spark SQL
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Connector architecture – Spark Streaming
Streaming data INTO Cassandra table
•  trivial setup
•  be careful ab...
@doanduyhai
Spark Streaming
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Q & R
! "
Spark/Cassandra operations
Cluster deployment & job lifecycle
Data locality
@doanduyhai
Cluster deployment
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Stand-alone cluster
27
@doanduyhai
Cassandra – Spark placement
Spark Worker Spark Worker Spark Worker Spark Worker
1 Cassandra process ⟷ 1 Spark ...
@doanduyhai
Cassandra – Spark job lifecycle
Spark Worker Spark Worker Spark Worker Spark Worker
C* C* C* C*
29
Spark Maste...
@doanduyhai
Cassandra – Spark job lifecycle
Spark Worker Spark Worker Spark Worker Spark Worker
C* C* C* C*
30
Spark Maste...
@doanduyhai
Cassandra – Spark job lifecycle
Spark Worker Spark Worker Spark Worker Spark Worker
C* C* C* C*
31
Spark Maste...
@doanduyhai
Cassandra – Spark job lifecycle
Spark Worker Spark Worker Spark Worker Spark Worker
C* C* C* C*
32
Spark Maste...
@doanduyhai
Cassandra – Spark job lifecycle
Spark Worker Spark Worker Spark Worker Spark Worker
C* C* C* C*
33
Spark Maste...
@doanduyhai
Data Locality – Cassandra token ranges
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8...
@doanduyhai
Data Locality – How To
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Spark partition RDD
Cassandra
...
@doanduyhai
Data Locality – How To
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
Use Murmur3Partitioner
36
@doanduyhai
Read data locality
Read from Cassandra
37
@doanduyhai
Read data locality
Spark shuffle operations
38
@doanduyhai
Write to Cassandra without data locality
Async batches fan-out writes to Cassandra
Because of shuffle, origina...
@doanduyhai
Write to Cassandra with data locality
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")...
@doanduyhai
Write data locality
•  either stream data in Spark layer using repartitionByCassandraReplica()
•  or flush dat...
@doanduyhai
Joins with data locality
CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));
CREATE TABLE albums...
@doanduyhai
Joins pipeline with data locality
LOCAL READ
FROM CASSANDRA
43
@doanduyhai
Joins pipeline with data locality
REPARTITION TO MAP
CASSANDRA REPLICA
44
@doanduyhai
Joins pipeline with data locality
JOIN WITH
DATA LOCALITY
45
@doanduyhai
Perfect data locality scenario
•  read localy from Cassandra
•  use operations that do not require shuffle in ...
Spark/Cassandra use-case demos
Data cleaning
Schema migration
Analytics
@doanduyhai
Use Cases
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normali...
@doanduyhai
Data cleaning use-cases
Bug in your application ?
Dirty input data ?
☞ Spark job to clean it up! (perfect data...
@doanduyhai
Data Cleaning
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Schema migration use-cases
Business requirements change with time ?
Current data model no longer relevant ?
☞ ...
@doanduyhai
Data Migration
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Analytics use-cases
Given existing tables of performers and albums, I want to :
•  count the number of albums ...
@doanduyhai
Analytics pipeline
①  Read from production transactional tables
②  Perform aggregation with Spark
③  Save back...
@doanduyhai
Analytics
https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
@doanduyhai
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Upcoming SlideShare

Loading in …5

×

  1. 1. @doanduyhai Cassandra & Spark, closing the gap between NoSQL and analytics DuyHai DOAN, Technical Advocate
  2. 2. @doanduyhai Who Am I ? Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  3. 3. @doanduyhai Datastax •  Founded in April 2010 •  We contribute a lot to Apache Cassandra™ •  400+ customers (25 of the Fortune 100), 400+ employees •  Headquarter in San Francisco Bay area •  EU headquarter in London, offices in France and Germany •  Datastax Enterprise = OSS Cassandra + extra features 3
  4. 4. @doanduyhai Spark – Cassandra Use Cases Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize, transform data Schema migration, Data conversion 4
  5. 5. Spark & Cassandra Presentation Spark & its eco-system Cassandra Quick Recap
  6. 6. @doanduyhai What is Apache Spark ? Created at Apache Project since 2010 General data processing framework Faster than Hadoop, in memory One-framework-many-components approach 6
  7. 7. @doanduyhai Partitions transformations map(tuple => (tuple._3, tuple)) Direct transformation Shuffle (expensive !) groupByKey() countByKey() partition RDD Final action 7
  8. 8. @doanduyhai Spark eco-system Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … etc… 8
  9. 9. @doanduyhai Spark eco-system Local Standalone cluster YARN Mesos Spark Core Engine (Scala/Java/Python) Spark Streaming MLLibGraphXSpark SQL Persistence Cluster Manager … etc… 9
  10. 10. @doanduyhai What is Apache Cassandra? Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency Distributed table abstraction 10
  11. 11. @doanduyhai Cassandra data distribution reminder Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2) n1 n2 n3 n4 n5 n6 n7 n8 11
  12. 12. @doanduyhai Cassandra token ranges A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 12
  13. 13. @doanduyhai Linear scalability n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H user_id1 user_id2 user_id3 user_id4 user_id5 13
  14. 14. @doanduyhai Linear scalability n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H user_id1 user_id2 user_id3 user_id4 user_id5 14
  15. 15. @doanduyhai Cassandra Query Language (CQL) INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); UPDATE users SET age = 34 WHERE login = ‘jdoe’; DELETE age FROM users WHERE login = ‘jdoe’; SELECT age FROM users WHERE login = ‘jdoe’; 15
  16. 16. Spark & Cassandra Connector Spark Core API SparkSQL/DataFrame Spark Streaming
  17. 17. @doanduyhai Spark/Cassandra connector architecture All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath Scala and Java support. Python support via PySpark (exp.) 17
  18. 18. @doanduyhai Connector architecture – Core API Cassandra tables exposed as Spark RDDs Read from and write to Cassandra Mapping of C* tables and rows to Scala objects •  CassandraRDD and CassandraRow •  Scala case class (object mapper) •  Scala tuples 18
  19. 19. @doanduyhai Spark Core https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  20. 20. @doanduyhai Connector architecture – DataFrame Mapping of Cassandra table to DataFrame •  CassandraSQLContext à org.apache.spark.sql.SQLContext •  CassandraSQLRow à org.apache.spark.sql.catalyst.expressions.Row •  Mapping of Cassandra types to Catalyst types •  CassandraCatalog à Catalog (used by Catalyst Analyzer) 20
  21. 21. @doanduyhai Connector architecture – DataFrame Mapping of Cassandra table to SchemaRDD •  CassandraSourceRelation •  extends BaseRelation with InsertableRelation with PruntedFilteredScan •  custom query plan •  push predicates to CQL for early filtering (if possible) SELECT * FROM user_emails WHERE login = ‘jdoe’; 21
  22. 22. @doanduyhai Spark SQL https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  23. 23. @doanduyhai Connector architecture – Spark Streaming Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model when having an infinite stream !!! Streaming data OUT of Cassandra tables (CASSANDRA-8844) ? •  notification system (publish/subscribe) •  at-least-once delivery semantics •  work in progress … 23
  24. 24. @doanduyhai Spark Streaming https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  25. 25. @doanduyhai Q & R ! "
  26. 26. Spark/Cassandra operations Cluster deployment & job lifecycle Data locality
  27. 27. @doanduyhai Cluster deployment C* SparkM SparkW C* SparkW C* SparkW C* SparkW C* SparkW Stand-alone cluster 27
  28. 28. @doanduyhai Cassandra – Spark placement Spark Worker Spark Worker Spark Worker Spark Worker 1 Cassandra process ⟷ 1 Spark worker C* C* C* C* 28 Spark Master
  29. 29. @doanduyhai Cassandra – Spark job lifecycle Spark Worker Spark Worker Spark Worker Spark Worker C* C* C* C* 29 Spark Master Spark Client Driver Program Spark Context 1 D e f i n e y o u r business logic here !
  30. 30. @doanduyhai Cassandra – Spark job lifecycle Spark Worker Spark Worker Spark Worker Spark Worker C* C* C* C* 30 Spark Master Spark Client Driver Program Spark Context 2
  31. 31. @doanduyhai Cassandra – Spark job lifecycle Spark Worker Spark Worker Spark Worker Spark Worker C* C* C* C* 31 Spark Master Spark Client Driver Program Spark Context 3 333
  32. 32. @doanduyhai Cassandra – Spark job lifecycle Spark Worker Spark Worker Spark Worker Spark Worker C* C* C* C* 32 Spark Master Spark Client Driver Program Spark Context Spark Executor Spark Executor Spark Executor Spark Executor 444 4
  33. 33. @doanduyhai Cassandra – Spark job lifecycle Spark Worker Spark Worker Spark Worker Spark Worker C* C* C* C* 33 Spark Master Spark Client Driver Program Spark Context Spark Executor Spark Executor Spark Executor Spark Executor 5 5 5 5
  34. 34. @doanduyhai Data Locality – Cassandra token ranges A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H 34
  35. 35. @doanduyhai Data Locality – How To C* SparkM SparkW C* SparkW C* SparkW C* SparkW C* SparkW Spark partition RDD Cassandra tokens ranges 35
  36. 36. @doanduyhai Data Locality – How To C* SparkM SparkW C* SparkW C* SparkW C* SparkW C* SparkW Use Murmur3Partitioner 36
  37. 37. @doanduyhai Read data locality Read from Cassandra 37
  38. 38. @doanduyhai Read data locality Spark shuffle operations 38
  39. 39. @doanduyhai Write to Cassandra without data locality Async batches fan-out writes to Cassandra Because of shuffle, original data locality is lost 39
  40. 40. @doanduyhai Write to Cassandra with data locality Write to Cassandra rdd.repartitionByCassandraReplica("keyspace","table") 40
  41. 41. @doanduyhai Write data locality •  either stream data in Spark layer using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic) 41
  42. 42. @doanduyhai Joins with data locality CREATE TABLE artists(name text, style text, … PRIMARY KEY(name)); CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title)); val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) 42
  43. 43. @doanduyhai Joins pipeline with data locality LOCAL READ FROM CASSANDRA 43
  44. 44. @doanduyhai Joins pipeline with data locality REPARTITION TO MAP CASSANDRA REPLICA 44
  45. 45. @doanduyhai Joins pipeline with data locality JOIN WITH DATA LOCALITY 45
  46. 46. @doanduyhai Perfect data locality scenario •  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table Sanitize, validate, normalize, transform data USE CASE 46
  47. 47. Spark/Cassandra use-case demos Data cleaning Schema migration Analytics
  48. 48. @doanduyhai Use Cases Load data from various sources Analytics (join, aggregate, transform, …) Sanitize, validate, normalize, transform data Schema migration, Data conversion 48
  49. 49. @doanduyhai Data cleaning use-cases Bug in your application ? Dirty input data ? ☞ Spark job to clean it up! (perfect data locality) Sanitize, validate, normalize, transform data 49
  50. 50. @doanduyhai Data Cleaning https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  51. 51. @doanduyhai Schema migration use-cases Business requirements change with time ? Current data model no longer relevant ? ☞ Spark job to migrate data ! Schema migration, Data conversion 51
  52. 52. @doanduyhai Data Migration https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  53. 53. @doanduyhai Analytics use-cases Given existing tables of performers and albums, I want to : •  count the number of albums releases by decade (70’s, 80’s, 90’s, …) ☞ Spark job to compute analytics ! Analytics (join, aggregate, transform, …) 53
  54. 54. @doanduyhai Analytics pipeline ①  Read from production transactional tables ②  Perform aggregation with Spark ③  Save back data into dedicated tables for fast visualization ④  Repeat step ① 54
  55. 55. @doanduyhai Analytics https://github.com/doanduyhai/incubator-zeppelin/tree/ApacheBigData
  56. 56. @doanduyhai Thank You @doanduyhai duy_hai.doan@datastax.com https://academy.datastax.com/

×

Related Articles

python
cassandra
spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra