Here are some key words to know to understand the write path
Partitioning Key — each table has a Partitioning Key. It helps with determining which node in the cluster the data should be stored.
Commit Log —the transactional log. It’s used for transactional recovery in case of system failures. It’s an append only file and provides durability.
Memtable — a memory cache to store the in memory copy of the data. Each node has a memtable for each CQL table. The memtable accumulates writes and provides read for data which are not yet stored to disk.
SSTable —the final destination of data in C*. They are actual files on disk and are immutable.
Compaction —the periodic process of merging multiple SSTables into a single SSTable. It’s primarily done to optimize the read operations.
How is data written?
- Cassandra appends writes to the commit log on disk. The commit log receives every write made to a Cassandra node and these durable writes survive permanently even if power fails on a node.
- Cassandra also stores the data in a memory structure called memtable and to provide configurable durability. The memtable is a write-back cache of data partitions that Cassandra looks up by key.
- The memtable stores writes in a sorted order until reaching a configurable limit and then it is flushed.
- Then it is flushed to a a sorted strings table called an SSTable. Writes are atomic at row level. Meaning that all columns are written or updated, or none are.
Side note — Automicity is one of the ACID (Atomicity, Consistency, Isolation, Durability) transaction properties. Atomicity means that it cannot be divided or split in smaller points.
When does Cassandra flush the memtable to disk (SSTable)?
Cassandra flushes the memtable to disk (SSTable) in the occurrence of the following events.
- When the memtable contents exceed a configurable threshold controlled by memtable_total_space_in_mb
- When the commit log is full controlled by commitlog_total_space_in_mb
- When nodetool flush command is issued
Each flush results in a new SSTable. If over a period of time we end up with a huge number of SSTables, this could slow down the read performance as we might have to walk through . multiple SSTables to read a record. So how to fix this?
Cassandra periodically performs a background operation called compaction to merge the SSTables in order to optimize the read operation.
On occasion, a node becomes unresponsive while data is being written due to hardware problems, network issues or overloaded nodes.
Hinted Handoff is an optional part of writes whose primary purpose is to provide extreme write availability when consistency is not required. It can reduce the time required for a temporarily failed node to become consistent again with live ones.
How does it work?
If a write is made and a replica node for the row is known to be down, C* will write a hint to a live replica node letting them now that the write needs to be replayed to the unavailable node. If no replica nodes are alive for this row key, the coordinating node will write the hint locally. Once a node discovers via Gossip that a node with the hint has recovered, it will send the data row corresponding to each hint to the target.
- ANY — a write must succeed on any available node
- ONE — a write must succeed on any node responsible for that row (either primary or replica)
- QUORUM — a write must succeed on a quorum or replica nodes (replication_factor / 2 + 1)
- LOCAL_QUORUM — a write must succeed on a quorum or replica nodes in the same data center as the coordinator node
- EACH_QUORUM — a write must succeed on a quorum of replica nodes in all data centers
- ALL — a write must succeed on all replica nodes for a row key
Reads are a little more complicated than write paths. In addition to the write path key words, here are additional words that are key to understand read paths.
- Row Cache — a memory cache which stores recently read rows (records). It’s an optional component in C*.
- Bloom Filters — helps to point if a partition key may exist in its corresponding SSTable.
- (Partition) Key Cache — key cache maps recently read read partition keys to specific SSTable offset.
- Partition Indexes — sorted partition keys mapped to their SSTable offsets. Partition Indexes are created as part of the SSTable creation and resides on the disk.
- Partition Summaries — an off heap in memory sampling of the Partition Indexes and is used to speed up the access to index on disk.
- Compression Offsets — keeps the offset mapping information for compressed blocks. By default all tables in C* are compressed and when C* needs to read data, it looks into the in memory compression offset maps and unpacks the data chunks.
How does Cassandra read data?
- Cassandra first checks if the in-memory memtable cache still contain the data. Memtable is an in memory read/write cache for each column family.
- If not found, Cassandra will read all SSTables for that Column Family.
- To optimize reads…
- Cassandra uses bloom filter for each SSTable to determine whether this SSTable contains the key
- Cassandra uses index in SSTable to locate the data fast
- Cassandra compaction merges SSTables when the number of SSTables reaches certain threshold.
- Cassandra read is slower than write but yet still very fast
4. Cassandra depends on OS to cache SSTable files
- Do not configure C* to use up most physical memory
- Some deployment configures C* to use 50% of the physical memory so the rest can be used for file cache
- However, memory configuration is sensitive to data access pattern and volume
Cassandra ensures that frequently read data remains consistent. Once a read is done, the coordinator node compares the data from all remaining replicas that own the row in the background. If they are inconsistent, issues writes to the out of data replicas to update the row to reflect the most recently written values.
Read repairs can be configured per column family and is enabled by default.
Read repair is important because every time a read request occurs, it provides an opportunity for consistency improvement. As a background process, read repair generally puts little strain on the cluster.
How does a read repair work?
When a query is made against a given key…
- Cassandra performs a Read Repair
- Read Repair perform a digest query on all replicas for that key. A digest query asks a replica to return a hash digest value and the timestamp for the key’s data. Digest query verify whether replica posses the same data without sending the data over the network.
- Cassandra pushes the most recent data to any out of date replicas to make the queries data consistent again. Next query will therefore return a consistent data.
- ONE — reads from the closest node holding the data
- QUORUM — returns a result from a quorum of servers with the most recent timestamp of data
- LOCAL_QUORUM — returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as teh coordinator node
- EACH_QUORUM — returns a result from a quorum of servers with the most recent timestamp in all data centers
- ALL — returns a result from all replica nodes for a row key
Consistency, Availability, Partition Tolerance
This theorem states that it’s impossible for a distributed system to have all of the 3 properties.
- Consistency — all nodes see the same data at the same time
- Availability — a guarantee that every request receieves a response about whether it was successful or failed
- Partition tolerance — the system continues to operate despite arbitrary message loss or failure of part of the system
Cassandra is an AP system meaning it’s more important to be available and partition tolerant. In C* you can choose between strong and eventual consistency. It can be done on a per-operation basis and for BOTH reads and writes. Also, it handles multi-data center operations.
Providing an argument to the CONSISTENCY command overrides the default consistency level of ONE for both read and write and configures the consistency level for the future requests.
Setting Consistency Level is done on a per query basis by adding USING CONSISTENCY <level> syntax as shown below.
SELECT total_purchases from SALES USING CONSISTENCY QUORUM where customer_id = 5
UPDATE SALES USING CONSISTENCY ONE SET total_purchases = 5000 WHERE customer_id = 4