Spark + Cassandra

Running Spark in the cloud

Running Spark on premises

Spark + Cassandra in different clusters

Spark + Cassandra in the same cluster

Spark with Cassandra. Source: https://opencredo.com/blogs/deploy-spark-apache-cassandra/
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test" ))
.load()
import org.apache.spark.sql.cassandra._
val df = spark
.read
.cassandraFormat("words", "test")
.load()
df.write
.cassandraFormat("words_copy", "test")
.mode("append")
.save()

Cassandra + Spark high performance cluster

Wide vs Narrow Dependencies. Source: https://medium.com/@dvcanton/wide-and-narrow-dependencies-in-apache-spark-21acf2faf031
source: https://luminousmen.com/

Filter Early

Spark Partitions and Spark Joins

Spark partitions. Source: https://luminousmen.com/post/spark-tips-partition-tuning
Sort Merge Join. Source: https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c
Sort merge join when exchange is required. source: https://towardsdatascience.com/should-i-repartition-836f7842298c

df = fact_table.join(broadcast(dimension_table), fact_table.col("dimension_id") === dimension_table.col("id"))
Broadcast Join. Source: https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c
df.repartition(100, "id").join(..., "id") 
spark.sql.shuffle.partitions // default 200
df.write.partitionBy('key').json('/path...')

Set the right number of executors, cores and memory

spark.executor.cores
# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

Broadcast Variables

val df = spark.sparkContext.broadcast(data)

Spark Serialization

Kyro Performance vs Java, Source: https://docs.mulesoft.com/mule-runtime/3.9/improving-performance-with-the-kryo-serializer
Kryo kryo = new Kryo();
kryo.register(SomeClass.class);

Spark Catalyst

Catalyst workflow
spark.sql.autoBroadcastJoinThreshold

Project Tungsten

Spark APIs comparison. Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Other Spark Optimizations

Spark and Cassandra Partitions

JoinWithCassandraTable()

Write settings

Read settings

Performance Tests

Spark Tuning. source: https://luminousmen.com/post/spark-tips-dataframe-api