The CQL — Cassandra Query language gives an almost SQL type interface to Apache Cassandra. I have found many times,that many who use this,do not know about some important points of Cassandra that makes it different from SQL databases like Postgres. Same is the case for operations team, there are some aspects related to storage, GC settings , that many are not aware of. I am not an expert in Cassandra internals and don’t aspire to be if I can avoid it.This is mostly a note to myself, and something which I can ask others to refer to instead of repeating over email a gazillion times. There are lot of other parts like repair etc which I have left out. The intention here is to make this as short as possible, but if you feel somethings are to be added, please comment.
Cassandra has Tune-able Consistency — not just eventual consistency
Many considering Cassandra as a replacement for SQL database like Postgres, MySQL or Oracle, shy away thinking that eventual consistency of NoSQL does not meet their requirement. In Cassandra ,however consistency is configurable. This means that with some write and read speed sacrifice, you can have strong consistency as well as high availability. Cassandra can be used for small data as well as big data; depending on your use case you can tune the consistency per key-space or even per-operation.
At this point it may be a good idea to have a short recap of CAP theorem as there is a lot of confusion translating the theoretical surmise to the practical world.
In 2000, Dr. Eric Brewer gave a keynote at the Proceedings of the Annual ACM Symposium on Principles of Distributed Computing in which he laid out his famous CAP Theorem: a shared-data system can have at most two of the three following properties: Consistency, Availability, and tolerance to network Partitions.
This applies to any distributed data base, not just Cassandra.So Cassandra can provide C and A not P ? Is it a big problem ?
Short answer — It is not. Skip the rest of the section if you are in a hurry.
Long answer read on.Here is the excerpt from Cassandra docs. (DataStax’s docs)
… You can tune Cassandra’s consistency level per-operation, or set it globally for a cluster or datacenter. You can vary the consistency for individual read or write operations so that the data returned is more or less consistent, as required by the client application. This allows you to make Cassandra act more like a CP (consistent and partition tolerant) or AP (highly available and partition tolerant) system according to the CAP theorem, depending on the application requirements.
Here is an excerpt from the article linked in the DataStax’ s Cassandra documentation page.
Of the CAP theorem’s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot not choose it. Instead of CAP, you should think about your availability in terms of yield (percent of requests answered successfully) and harvest (percent of required data actually included in the responses) and which of these two your system will sacrifice when failures happen. -https://codahale.com/you-cant-sacrifice-partition-tolerance/
What the above article explains in depth is that Availability is tied to Network Partitioning or Partition Tolerance.Worst case scenario network partitions are quite rare inside a Data Center network. Also network partitions cannot be prevented from happening. It is ever present, though mostly transient and intermittent. The risk of network partitioning across many nodes in a cluster so as to disrupt Availability for a multi-node cluster is very less.
So with Cassandra you can have as good a C and A system as practically possible.
Give Importance to modelling the Partition key
If there is only one thing that you should read,maybe it is the link below
1. Spread data evenly around the cluster — Model Partition Key
2. Minimize the number of partitions read -Model Partition Key and Clustering keys
Let us take an example. Below is a initial modelling of table where the data is some events (say political rallies, speeches etc) that has occurred in a particular location, centered over latitude,longitude and say having a radius of 250 meters. Each location has an influential candidate of that area. Sometimes the same area can have multiple influential candidates. I have illustrated a slightly complex example so as to show the flexibility in data types present in Cassandra,and all the aspects to consider when modelling the key. The actual cell can be a telecom cell with multiple coverage by different operators or different technologies. The example I give here is a bit contrived and for illustration only.