9/7/2018

Reading time:1 min

Apache Cassandra, Part 6: Time Series Modelling

by Haris Hasan

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.Consider a scenario where we have weather data coming in from various weather stations and we have to store key value pairs of time and temperature against each station.Cassandra supports up to 2 Billion key value pairs in a row which means we can design our data model like this.https://academy.datastax.com/resources/getting-started-time-series-data-modelingIn this design station Id is partition key, therefore, we will be able to easily find partition containing data from a particular station. Each station would be transmitting data points at same rate, so each partition will have same amount of data. Hence the proposed data model satisfies both of the Cassandra’s data modelling goals.But there is a problem, if a weather station transmits a new entry every second, we are will end up with huge partitions pretty soon. An improvement could be to create a composite partition key using station Id and date like this.https://academy.datastax.com/resources/getting-started-time-series-data-modelingThis way we achieve manageable sized partitions and efficient date wise access to weather data.In weather domain most of the queries will involve latest data. Our current model appends each new entry at the end, therefore, to find the latest N records Cassandra will go until the end of partition and read last N records. A better approach could be to store data in reverse timestamp order like this.https://academy.datastax.com/resources/getting-started-time-series-data-modelingThis approach will reduce our read cost by keeping fresh data at start of partition. Older data can be deleted from the end to keep the partition size manageable.Cassandra also provides a data type time-uuid that comes handy when multiple events might come in with same timestamp. To avoid timestamp collision time-uuid appends timestamp with a random Id which guarantees record uniqueness.Next: Apache Cassandra, Part 7: Secondary Index, Replication, Tips

Read this article if you want to know more about Apache Cassandra, Part 6: Time Series Modelling

This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.

Consider a scenario where we have weather data coming in from various weather stations and we have to store key value pairs of time and temperature against each station.

Cassandra supports up to 2 Billion key value pairs in a row which means we can design our data model like this.

https://academy.datastax.com/resources/getting-started-time-series-data-modeling

In this design station Id is partition key, therefore, we will be able to easily find partition containing data from a particular station. Each station would be transmitting data points at same rate, so each partition will have same amount of data. Hence the proposed data model satisfies both of the Cassandra’s data modelling goals.

But there is a problem, if a weather station transmits a new entry every second, we are will end up with huge partitions pretty soon. An improvement could be to create a composite partition key using station Id and date like this.

This way we achieve manageable sized partitions and efficient date wise access to weather data.

In weather domain most of the queries will involve latest data. Our current model appends each new entry at the end, therefore, to find the latest N records Cassandra will go until the end of partition and read last N records. A better approach could be to store data in reverse timestamp order like this.

This approach will reduce our read cost by keeping fresh data at start of partition. Older data can be deleted from the end to keep the partition size manageable.

Cassandra also provides a data type time-uuid that comes handy when multiple events might come in with same timestamp. To avoid timestamp collision time-uuid appends timestamp with a random Id which guarantees record uniqueness.

Next: Apache Cassandra, Part 7: Secondary Index, Replication, Tips

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further