12/2/2020

Reading time:6 min

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, D…

by Spark Summit

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, D… SlideShare Explore You Successfully reported this slideshow.Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Upcoming SlideShareLoading in …5× 10 Comments 17 Likes Statistics Notes johnthain123 Emre Sevinç , Co-founder & CTO at TM Data ICT Solutions Landry Aguehounde julien diener Marat Charlaganov , Researcher at Vrije Universiteit Amsterdam at Vrije Universiteit Amsterdam / VU University Amsterdam Show More No DownloadsNo notes for slide 1. Cassandra and Spark:Optimizing for DataLocalityRussell Spitzer Software Engineer @ DataStax 2. Lex Luther Was Right:Location is ImportantThe value of many things is based upon it'slocation. Developed land near the beach isvaluable but desert and farmland is generallymuch cheaper. Unfortunately moving land isgenerally impossible.Spark Summit 3. or lake, or swamp or whatever body of wateris "Data" at the time this slide is viewed.Lex Luther Was Wrong:Don't ETL the Data OceanSpark SummitMy House 4. Spark is Our Hero Giving us theAbility to Do Our Analytics Without the ETLPARK 5. Moving Data Between Machines Is ExpensiveDo Work Where the Data Lives!Our Cassandra Nodes are like Cities and our Spark Executorsare like Super Heroes. We'd rather they spend their time locallyrather than flying back and forth all the time. 6. Moving Data Between Machines Is ExpensiveDo Work Where the Data Lives!Our Cassandra Nodes are like Cities and our Spark Executorsare like Super Heroes. We'd rather they spend their time locallyrather than flying back and forth all the time.MetropolisSupermanGothamBatmanSpark Executors 7. DataStax Open SourceSpark Cassandra Connector is Available on Githubhttps://github.com/datastax/spark-cassandra-connector• Compatible with Spark 1.3• Read and Map C* Data Types• Saves To Cassandra• Intelligent write batching• Supports Collections• Secondary index pushdown• Arbitrary CQL Execution! 8. How the Spark Cassandra ConnectorReads Data Node Local 9. Cassandra Locates a Row Based onPartition Key and Token RangeAll of the rows in a Cassandra Clusterare stored based based on theirlocation in the Token Range. 10. MetropolisGothamCoast CityEach of the Nodes in a  Cassandra Cluster is primarilyresponsible for one set ofTokens.0999500Cassandra Locates a Row Based onPartition Key and Token Range 11. MetropolisGothamCoast CityEach of the Nodes in a  Cassandra Cluster is primarilyresponsible for one set ofTokens.0999500750 - 99350 - 749100 - 349Cassandra Locates a Row Based onPartition Key and Token Range 12. Jacek 514 RedThe CQL Schema designatesat least one column to be thePartition Key.MetropolisGothamCoast CityCassandra Locates a Row Based onPartition Key and Token Range 13. Jacek 514 RedThe hash of the Partition Keytells us where a rowshould be stored.MetropolisGothamCoast CityCassandra Locates a Row Based onPartition Key and Token Range830 14. With VNodes the ranges arenot contiguous but the samemechanism controls rowlocation.Jacek 514 RedMetropolisGothamCoast CityCassandra Locates a Row Based onPartition Key and Token Range830 15. Loading Huge Amounts of DataTable Scans involve loading most of the data in Cassandra 16. Cassandra RDD Use the Token Range to Create NodeLocal Spark Partitionssc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)Token Rangesspark.cassandra.input.split.sizeThe (estimated) number of C* Partitions to be placed in a Spark Partition 17. Cassandra RDD Use the Token Range to Create NodeLocal Spark Partitionssc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)Token Rangesspark.cassandra.input.split.sizeThe (estimated) number of C* Partitions to be placed in a Spark PartitionCassandraRDDSparkPartition 18. Cassandra RDD Use the Token Range to Create NodeLocal Spark Partitionssc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)Token Rangesspark.cassandra.input.split.sizeThe (estimated) number of C* Partitions to be placed in a Spark PartitionCassandraRDD 19. Cassandra RDD Use the Token Range to Create NodeLocal Spark Partitionssc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)Token Rangesspark.cassandra.input.split.sizeThe (estimated) number of C* Partitions to be placed in a Spark PartitionCassandraRDD 20. Cassandra RDD Use the Token Range to Create NodeLocal Spark Partitionssc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra)Token Rangesspark.cassandra.input.split.sizeThe (estimated) number of C* Partitions to be placed in a Spark PartitionCassandraRDD 21. Spark DriverSpark Partitions Are Annotated With the Location ForTokenRanges they SpanMetropolisMetropolisSupermanGothamBatmanCoast CityGreen L.The Driver waits spark.locality.wait for thepreferred location to have an open executorAssigns Task 22. MetropolisSpark Executor (Superman)The Spark Executor uses the Java Driver toPull Rows from the Local Cassandra InstanceOn the Executor the task is transformed into CQL queries which areexecuted via the Java Driver.SELECT * FROM keyspace.table WHEREtoken(pk) > 780 and token(pk) <= 830Tokens 780 - 830 23. MetropolisSpark Executor (Superman)The Spark Executor uses the Java Driver toPull Rows from the Local Cassandra InstanceSELECT * FROM keyspace.table WHEREtoken(pk) > 780 and token(pk) <= 830Tokens 780 - 830The C* Java Driver pages spark.cassandra.input.page.row.sizeCQL rows at a time 24. MetropolisSpark Executor (Superman)The Spark Executor uses the Java Driver toPull Rows from the Local Cassandra InstanceSELECT * FROM keyspace.table WHERE  (Pushed Down Clauses) ANDtoken(pk) > 780 and token(pk) <= 830Tokens 780 - 830Because we are utilizing CQL we can also pushdown predicates whichcan be handled by C*. 25. Loading Sizable But Defined Amounts of DataRetrieving sets of Partition Keys can be done in parallel 26. joinWithCassandraTable Provides an Interface forObtaining a Set of C* PartitionsGeneric RDDMetropolisSupermanGothamBatmanCoast CityGreen L.Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local 27. Generic RDDMetropolisSupermanGothamBatmanCoast CityGreen L.Generic RDDs can Be Joined But the Spark Tasks will Not be Node LocaljoinWithCassandraTable Provides an Interface forObtaining a Set of C* Partitions 28. repartitionByCassandraReplica Repartitions RDD's to beC* LocalGeneric RDD CassandraPartitionedRDDThis operation requires a shuffle 29. joinWithCassandraTable on CassandraPartitionedRDDs(or CassandraTableScanRDDs) will be Node localCassandraPartitionedRDDs are partitioned to be executed node localCassandraPartitionedRDDMetropolisSupermanGothamBatmanCoast CityGreen L. 30. MetropolisSpark Executor (Superman)The Spark Executor uses the Java Driver toPull Rows from the Local Cassandra InstanceThe C* Java Driver pages spark.cassandra.input.page.row.sizeCQL rows at a timeSELECT * FROM keyspace.table WHEREpk = 31. DataStax Enterprise Comes Bundledwith Spark and the ConnectorApache Spark Apache SolrDataStax DeliversApache CassandraIn A Database Platform 32. DataStax Enterprise Enables This Same Machinery  with Solr PushdownMetropolisSpark Executor (Superman)DataStaxEnterpriseSELECT * FROM keyspace.tableWHERE solr_query = 'title:b'ANDtoken(pk) > 780 and token(pk) <= 830Tokens 780 - 830 33. Learn More Online and at Cassandra Summithttps://academy.datastax.com/ Recommended Gluster.community.day.2013Udo Seidel Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc Performance comparison of Distributed File Systems on 1Gbit networksMarian Marinov Hadoop scalabilityWANdisco Plc Hadoop 3 in a NutshellDataWorks Summit/Hadoop Summit Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu WangSpark Summit VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit About Blog Terms Privacy Copyright × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Read this article if you want to know more about Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, D…

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, D…

SlideShare Explore You

Successfully reported this slideshow.

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)

Cassandra and Spark:
Optimizing for Data
Locality
Russell Spitzer 
Software Engineer @ DataStax

Upcoming SlideShare

Loading in …5

×

10 Comments

1. Cassandra and Spark: Optimizing for Data Locality Russell Spitzer  Software Engineer @ DataStax
2. Lex Luther Was Right: Location is Important The value of many things is based upon it's location. Developed land near the beach is valuable but desert and farmland is generally much cheaper. Unfortunately moving land is generally impossible. Spark Summit
3. or lake, or swamp or whatever body of water is "Data" at the time this slide is viewed. Lex Luther Was Wrong: Don't ETL the Data Ocean Spark Summit My House
4. Spark is Our Hero Giving us the Ability to Do Our Analytics Without the ETL PARK
5. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time.
6. Moving Data Between Machines Is Expensive Do Work Where the Data Lives! Our Cassandra Nodes are like Cities and our Spark Executors are like Super Heroes. We'd rather they spend their time locally rather than flying back and forth all the time. Metropolis Superman Gotham Batman Spark Executors
7. DataStax Open Source Spark Cassandra Connector is Available on Github https://github.com/datastax/spark-cassandra-connector • Compatible with Spark 1.3 • Read and Map C* Data Types • Saves To Cassandra • Intelligent write batching • Supports Collections • Secondary index pushdown • Arbitrary CQL Execution!
8. How the Spark Cassandra Connector Reads Data Node Local
9. Cassandra Locates a Row Based on Partition Key and Token Range All of the rows in a Cassandra Cluster are stored based based on their location in the Token Range.
10. Metropolis Gotham Coast City Each of the Nodes in a   Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 Cassandra Locates a Row Based on Partition Key and Token Range
11. Metropolis Gotham Coast City Each of the Nodes in a   Cassandra Cluster is primarily responsible for one set of Tokens. 0999 500 750 - 99 350 - 749 100 - 349 Cassandra Locates a Row Based on Partition Key and Token Range
12. Jacek 514 Red The CQL Schema designates at least one column to be the Partition Key. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range
13. Jacek 514 Red The hash of the Partition Key tells us where a row should be stored. Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
14. With VNodes the ranges are not contiguous but the same mechanism controls row location. Jacek 514 Red Metropolis Gotham Coast City Cassandra Locates a Row Based on Partition Key and Token Range 830
15. Loading Huge Amounts of Data Table Scans involve loading most of the data in Cassandra
16. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition
17. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD Spark Partition
18. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
19. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
20. Cassandra RDD Use the Token Range to Create Node Local Spark Partitions sc.cassandraTable or sqlContext.load(org.apache.spark.sql.cassandra) Token Ranges spark.cassandra.input.split.size The (estimated) number of C* Partitions to be placed in a Spark Partition CassandraRDD
21. Spark Driver Spark Partitions Are Annotated With the Location For TokenRanges they Span Metropolis Metropolis Superman Gotham Batman Coast City Green L. The Driver waits spark.locality.wait for the preferred location to have an open executor Assigns Task
22. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance On the Executor the task is transformed into CQL queries which are executed via the Java Driver. SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
23. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time
24. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance SELECT * FROM keyspace.table WHERE   (Pushed Down Clauses) AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830 Because we are utilizing CQL we can also pushdown predicates which can be handled by C*.
25. Loading Sizable But Defined Amounts of Data Retrieving sets of Partition Keys can be done in parallel
26. joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local
27. Generic RDD Metropolis Superman Gotham Batman Coast City Green L. Generic RDDs can Be Joined But the Spark Tasks will Not be Node Local joinWithCassandraTable Provides an Interface for Obtaining a Set of C* Partitions
28. repartitionByCassandraReplica Repartitions RDD's to be C* Local Generic RDD CassandraPartitionedRDD This operation requires a shuffle
29. joinWithCassandraTable on CassandraPartitionedRDDs (or CassandraTableScanRDDs) will be Node local CassandraPartitionedRDDs are partitioned to be executed node local CassandraPartitionedRDD Metropolis Superman Gotham Batman Coast City Green L.
30. Metropolis Spark Executor (Superman) The Spark Executor uses the Java Driver to Pull Rows from the Local Cassandra Instance The C* Java Driver pages spark.cassandra.input.page.row.size CQL rows at a time SELECT * FROM keyspace.table WHERE pk =
31. DataStax Enterprise Comes Bundled with Spark and the Connector Apache Spark Apache Solr DataStax Delivers Apache Cassandra In A Database Platform
32. DataStax Enterprise Enables This Same Machinery   with Solr Pushdown Metropolis Spark Executor (Superman) DataStax Enterprise SELECT * FROM keyspace.table WHERE solr_query = 'title:b' AND token(pk) > 780 and token(pk) <= 830 Tokens 780 - 830
33. Learn More Online and at Cassandra Summit https://academy.datastax.com/

×

Visibility Others can see my Clipboard

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us