This series of posts present an introduction to Apache Cassandra. It discusses key Cassandra features, its core concepts, how it works under the hood, how it is different from other data stores, data modelling best practices with examples, and some tips & tricks.
Secondary Index in Cassandra
The purpose of secondary indexes in Cassandra is not to provide fast access to data using attributes other than partition key, rather it just provides a convenience in writing queries and fetching data. The two disk pass approach of secondary indexes, one for reading the secondary index file and second for accessing the actual data makes it inefficient in terms of performance. You can read more on this here.
Replication in Cassandra
Cassandra supports async replication based on a specified replication factor. Consider a scenario where you have 99 partitions with a replication factor of 3. Cassandra will replicate data of each partition on two other partitions. As a result, if a query requests to read all data, Cassandra can find the required data by reading only 33 partitions, hence reducing the number of partitions to read. Replication not only plays an important role in read optimization, but more importantly, it enables fault tolerance by ensuring access to data in case some node goes down in a Cassandra cluster.
Cassandra Tips
Here are some tips that may come handy during your journey through Cassandra.
- Test your data model as early as possible. Build prototype, insert data, write queries, make sure your workflow works end to end.
- Use Cassandra stress to generate reads, writes, and to measure performance.
- Do not try to minimize writes, extra writes to improve reads is worth it.
- Data duplication is fact of life in Cassandra, don’t be afraid of it. Disk space is the cheapest available resource.
- Always use async writes to keep your code non blocking.
- Batch inserts are anti pattern unless batch data belongs to same partition.
- Regularly view Cassandra logs to look for warnings and suggestions.
- Benchmark to measure performance against your needs e.g. data ingestion rate (events/sec) and query execution time.
References and Further Reading
Here are some resources that helped me in writing this series.