Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

9/7/2018

Reading time:2 min

Apache Cassandra, Part 4: Data Flow and Core Concepts

by Haris Hasan

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.Data flow in Cassandra looks like this:https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.htmlIncoming data is written into a commit log as well as in an in-memory store. Once the in-memory store is full, data is flushed onto disk in SSTables. Purpose of committing data into commit log is to provide fault tolerance. If a node goes down while there was still data in in-memory table which was yet to be persisted, on restart, node restores it’s in-memory state using commit log and resumes operation normally.Core Cassandra ConceptsTwo concepts are of utmost importance to understand Cassandra and its data modelling.Partition Key: determines on which node in a Cassandra cluster data is going to be stored.The hash function or Cassandra partitioner decides, based on the partition key, which data to send at what node.For example, say you want to store data of four cities A, B, C and D. You have a 2 node Cassandra cluster. You choose city name as partition key. To store this data Cassandra will create four partitions (against 4 unique cities), 2 on each node.Clustering Key: determines how data is sorted within a partition. Continuing on our last example, let’s say each incoming data point has a city name (A, B, C or D) and an associated timestamp. You can tell Cassandra to use timestamp as clustering key to store data in sorted order within a partition. This will enable efficient data lookup within a partition.In short, partition key helps Cassandra in locating the node with specific data, while clustering key helps in efficiently finding data within a partition.Both partition and clustering key can be composite.Column Family StoreAs discussed earlier, Cassandra is neither a row based store nor it is a column based store, rather it is a column family store. But what does that mean?One way to understand how Cassandra stores its data is this data structure:Map<RowKey, SortedMap<ColumnKey, ColumnValue>>Each row is mapped to a node using partition key, and the value is a set of key value pairs sorted by clustering key.https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/In other words, Cassandra is a partitioned row store, where data is partitioned by partition key and, within a partition, sorted by clustering key.Next: Apache Cassandra, Part 5: Data Modelling with Examples

Illustration Image

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.

Data flow in Cassandra looks like this:

image
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html

Incoming data is written into a commit log as well as in an in-memory store. Once the in-memory store is full, data is flushed onto disk in SSTables. Purpose of committing data into commit log is to provide fault tolerance. If a node goes down while there was still data in in-memory table which was yet to be persisted, on restart, node restores it’s in-memory state using commit log and resumes operation normally.

Core Cassandra Concepts

Two concepts are of utmost importance to understand Cassandra and its data modelling.

Partition Key: determines on which node in a Cassandra cluster data is going to be stored.

image

The hash function or Cassandra partitioner decides, based on the partition key, which data to send at what node.

For example, say you want to store data of four cities A, B, C and D. You have a 2 node Cassandra cluster. You choose city name as partition key. To store this data Cassandra will create four partitions (against 4 unique cities), 2 on each node.

Clustering Key: determines how data is sorted within a partition. Continuing on our last example, let’s say each incoming data point has a city name (A, B, C or D) and an associated timestamp. You can tell Cassandra to use timestamp as clustering key to store data in sorted order within a partition. This will enable efficient data lookup within a partition.

In short, partition key helps Cassandra in locating the node with specific data, while clustering key helps in efficiently finding data within a partition.

Both partition and clustering key can be composite.

Column Family Store

As discussed earlier, Cassandra is neither a row based store nor it is a column based store, rather it is a column family store. But what does that mean?

One way to understand how Cassandra stores its data is this data structure:

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Each row is mapped to a node using partition key, and the value is a set of key value pairs sorted by clustering key.

image
https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/

In other words, Cassandra is a partitioned row store, where data is partitioned by partition key and, within a partition, sorted by clustering key.

Next: Apache Cassandra, Part 5: Data Modelling with Examples

Related Articles

cluster
troubleshooting
datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra