This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.
Consider a scenario where we have weather data coming in from various weather stations and we have to store key value pairs of time and temperature against each station.
Cassandra supports up to 2 Billion key value pairs in a row which means we can design our data model like this.
In this design station Id is partition key, therefore, we will be able to easily find partition containing data from a particular station. Each station would be transmitting data points at same rate, so each partition will have same amount of data. Hence the proposed data model satisfies both of the Cassandra’s data modelling goals.
But there is a problem, if a weather station transmits a new entry every second, we are will end up with huge partitions pretty soon. An improvement could be to create a composite partition key using station Id and date like this.
This way we achieve manageable sized partitions and efficient date wise access to weather data.
In weather domain most of the queries will involve latest data. Our current model appends each new entry at the end, therefore, to find the latest N records Cassandra will go until the end of partition and read last N records. A better approach could be to store data in reverse timestamp order like this.
This approach will reduce our read cost by keeping fresh data at start of partition. Older data can be deleted from the end to keep the partition size manageable.
Cassandra also provides a data type time-uuid that comes handy when multiple events might come in with same timestamp. To avoid timestamp collision time-uuid appends timestamp with a random Id which guarantees record uniqueness.
Next: Apache Cassandra, Part 7: Secondary Index, Replication, Tips