9/7/2018

Reading time:2 min

Apache Cassandra, Part 4: Data Flow and Core Concepts

by Haris Hasan

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.Data flow in Cassandra looks like this:https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.htmlIncoming data is written into a commit log as well as in an in-memory store. Once the in-memory store is full, data is flushed onto disk in SSTables. Purpose of committing data into commit log is to provide fault tolerance. If a node goes down while there was still data in in-memory table which was yet to be persisted, on restart, node restores it’s in-memory state using commit log and resumes operation normally.Core Cassandra ConceptsTwo concepts are of utmost importance to understand Cassandra and its data modelling.Partition Key: determines on which node in a Cassandra cluster data is going to be stored.The hash function or Cassandra partitioner decides, based on the partition key, which data to send at what node.For example, say you want to store data of four cities A, B, C and D. You have a 2 node Cassandra cluster. You choose city name as partition key. To store this data Cassandra will create four partitions (against 4 unique cities), 2 on each node.Clustering Key: determines how data is sorted within a partition. Continuing on our last example, let’s say each incoming data point has a city name (A, B, C or D) and an associated timestamp. You can tell Cassandra to use timestamp as clustering key to store data in sorted order within a partition. This will enable efficient data lookup within a partition.In short, partition key helps Cassandra in locating the node with specific data, while clustering key helps in efficiently finding data within a partition.Both partition and clustering key can be composite.Column Family StoreAs discussed earlier, Cassandra is neither a row based store nor it is a column based store, rather it is a column family store. But what does that mean?One way to understand how Cassandra stores its data is this data structure:Map<RowKey, SortedMap<ColumnKey, ColumnValue>>Each row is mapped to a node using partition key, and the value is a set of key value pairs sorted by clustering key.https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/In other words, Cassandra is a partitioned row store, where data is partitioned by partition key and, within a partition, sorted by clustering key.Next: Apache Cassandra, Part 5: Data Modelling with Examples

Read this article if you want to know more about Apache Cassandra, Part 4: Data Flow and Core Concepts

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.

Data flow in Cassandra looks like this:

https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html

Incoming data is written into a commit log as well as in an in-memory store. Once the in-memory store is full, data is flushed onto disk in SSTables. Purpose of committing data into commit log is to provide fault tolerance. If a node goes down while there was still data in in-memory table which was yet to be persisted, on restart, node restores it’s in-memory state using commit log and resumes operation normally.

Core Cassandra Concepts

Two concepts are of utmost importance to understand Cassandra and its data modelling.

Partition Key: determines on which node in a Cassandra cluster data is going to be stored.

The hash function or Cassandra partitioner decides, based on the partition key, which data to send at what node.

For example, say you want to store data of four cities A, B, C and D. You have a 2 node Cassandra cluster. You choose city name as partition key. To store this data Cassandra will create four partitions (against 4 unique cities), 2 on each node.

Clustering Key: determines how data is sorted within a partition. Continuing on our last example, let’s say each incoming data point has a city name (A, B, C or D) and an associated timestamp. You can tell Cassandra to use timestamp as clustering key to store data in sorted order within a partition. This will enable efficient data lookup within a partition.

In short, partition key helps Cassandra in locating the node with specific data, while clustering key helps in efficiently finding data within a partition.

Both partition and clustering key can be composite.

Column Family Store

As discussed earlier, Cassandra is neither a row based store nor it is a column based store, rather it is a column family store. But what does that mean?

One way to understand how Cassandra stores its data is this data structure:

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Each row is mapped to a node using partition key, and the value is a set of key value pairs sorted by clustering key.

https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/

In other words, Cassandra is a partitioned row store, where data is partitioned by partition key and, within a partition, sorted by clustering key.

Next: Apache Cassandra, Part 5: Data Modelling with Examples

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Core Cassandra Concepts

Column Family Store

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us