Apache Cassandra and Apache Spark product integration is one of the emerging trends in big data world today. Together, these two products can offer several advantages. Much has already been said about Cassandra and Spark integration. There are several products in the marker place today offering enterprise grade products.
This article is aimed at providing few modeling suggestions when you have a need to join two or more Cassandra tables using Spark. This ability to join Cassandra tables using Spark will give your several data modeling advantages for ETL/ELT process, ability to balance data redundancy and query flexibility, data analysis using Spark data frame API and Spark SQL.
Apache Spark is a distributed SQL Engine framework that allows Joining of several data sources such as Hadoop Files, Hive, Cassandra, JDBC/ODBC data sources and others. This list is continuously growing.
Cassandra is a popular NoSQL database widely used in OLTP applications. Cassandra database has CQL language interface which looks similar to SQL language, but it is not quite the same.
While traditional relational data sources store their data in row format, Cassandra stores its data in row partitions using column families. Cassandra data arrangement inside partition is very similar to pivot spreadsheet like format.
The concept of denormalized data model is heavily emphasized until now due to Cassandra’s inability to join tables. This still the case for pure Cassandra based on OLTP applications. However, new options are opening up for enterprises that are planning to integrate Apache Spark and Cassandra products.
Since this a heavy topic I want to release this in multiple sessions. First here is the explanation of how Spark SQL works.
Here is the link for Cassandra modeling best practices for Spark SQL joins.