Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

7/24/2018

Reading time:2 min

Data Modelling Recommended Practices - Instaclustr

by John Doe

Item Rational Schema Is it properly denormalised? Does it require multiple queries to fetch information, or could the table just include info from the other table? Is there potential to consolidate data from multiple tables? Developers from relational background may tend to normalised models resulting in inefficient use of Cassandra. Partition key cardinality allows high number of partitions (minimum 100,000 possible preferred) A low number of partitions will lead to inefficient read and writes and increase risk of unevenly sized partitions Partition key prevents substantial skewing of partitions? If it is possible for a small number of partitions to have vastly higher numbers of rows than average (say 100x) then this can cause significantly uneven performance and disk usage. Using collections (maps,list,set)? Number of elements is 64k, keep the total size of the collect small (<1MB) as the map is not paged. Very large collections can negatively  impact read/write performance. Is  gc_grace_seconds changed from default (864000)? If so, is that appropriate and impact considered? Lowering gc_crace_seconds results in space being reclaimed more quickly after deletes but runs small risk of “resurrected deletes” given we only run repairs weekly. Is caching set to KEYS_ONLY or NONE? Row caching for Cassandra 2.0 is often not effective. 2.1 row caching features may be effective if tuned correctly (see row-caching-in-cassandra-2-1 and Cassandra Docs) Is chosen compaction strategy appropriate? SizeTieredCompactionStrategy: default and suitable as a starting point for most uses cases with balance of reads and writes LevelledCompactionStrategy: does more compaction work to improve read performance. Generally used if high ratio of reads to writes. DateTieredCompactionStrategy: useful for data where data is “hot” when first written but sees less access over time. Check that the compaction strategy is appropriately tuned (see Cassandra Docs) defaults are usually ok, but DTCS requires specific compaction options set to be effective. Are counters used? Instaclustr only supports the use of counters with Cassandra 2.1 as Cassandra 2.0 counters are unreliable in many circumstances. Secondary Indexes Is cardinality of secondary index low? Cardinality of index should be at least an order of magnitude lower and preferable at least 100x lower than indexed table. Also secondary indexes on boolean columns are not effective. See Cassandra Docs and http://www.wentnet.com/blog/?p=77 Is the indexed column frequently updated/deleted? Overhead of maintaining index will be incurred on each update/delete and may also result in excessive tombstones in the index table. Queries Are there logged batches used? If so, are they relatively small (<100) Logged batches require coordinate node to control all operations and can result in very high load on coordinator node for large batches. Logged batches are only required for atomic operations across multiple rows/tables (not performance). Are there unlogged batches? If so, are they small (<100) or on the same partition key? Unlogged batches can improve performance but need to either be small or on a single partition key otherwise they can negatively impact performance. Not that unlogged batches do not provide atomic operations. For large range queries, is the client paging through results? Paging is necessary to read large results sets without memory constraints. Most drivers have inbuilt paging support but needs to be explicitly turned on in query code.   Does the query on the index lookup a row in a large partition?   Whole partition will be scanned to find matching rows – potentially expensive reads.

Illustration Image
Item Rational Schema Is it properly denormalised? Does it require multiple queries to fetch information, or could the table just include info from the other table? Is there potential to consolidate data from multiple tables? Developers from relational background may tend to normalised models resulting in inefficient use of Cassandra. Partition key cardinality allows high number of partitions (minimum 100,000 possible preferred) A low number of partitions will lead to inefficient read and writes and increase risk of unevenly sized partitions Partition key prevents substantial skewing of partitions? If it is possible for a small number of partitions to have vastly higher numbers of rows than average (say 100x) then this can cause significantly uneven performance and disk usage. Using collections (maps,list,set)? Number of elements is 64k, keep the total size of the collect small (<1MB) as the map is not paged. Very large collections can negatively  impact read/write performance. Is  gc_grace_seconds changed from default (864000)? If so, is that appropriate and impact considered? Lowering gc_crace_seconds results in space being reclaimed more quickly after deletes but runs small risk of “resurrected deletes” given we only run repairs weekly. Is caching set to KEYS_ONLY or NONE? Row caching for Cassandra 2.0 is often not effective. 2.1 row caching features may be effective if tuned correctly (see row-caching-in-cassandra-2-1 and Cassandra Docs) Is chosen compaction strategy appropriate?
  • SizeTieredCompactionStrategy: default and suitable as a starting point for most uses cases with balance of reads and writes
  • LevelledCompactionStrategy: does more compaction work to improve read performance. Generally used if high ratio of reads to writes.
  • DateTieredCompactionStrategy: useful for data where data is “hot” when first written but sees less access over time.
  • Check that the compaction strategy is appropriately tuned (see Cassandra Docs) defaults are usually ok, but DTCS requires specific compaction options set to be effective.
Are counters used? Instaclustr only supports the use of counters with Cassandra 2.1 as Cassandra 2.0 counters are unreliable in many circumstances. Secondary Indexes Is cardinality of secondary index low? Cardinality of index should be at least an order of magnitude lower and preferable at least 100x lower than indexed table.

Also secondary indexes on boolean columns are not effective.

See Cassandra Docs and

http://www.wentnet.com/blog/?p=77

Is the indexed column frequently updated/deleted? Overhead of maintaining index will be incurred on each update/delete and may also result in excessive tombstones in the index table. Queries Are there logged batches used? If so, are they relatively small (<100) Logged batches require coordinate node to control all operations and can result in very high load on coordinator node for large batches. Logged batches are only required for atomic operations across multiple rows/tables (not performance). Are there unlogged batches? If so, are they small (<100) or on the same partition key? Unlogged batches can improve performance but need to either be small or on a single partition key otherwise they can negatively impact performance. Not that unlogged batches do not provide atomic operations. For large range queries, is the client paging through results? Paging is necessary to read large results sets without memory constraints. Most drivers have inbuilt paging support but needs to be explicitly turned on in query code.  

Does the query on the index lookup a row in a large partition?

 

Whole partition will be scanned to find matching rows – potentially expensive reads.

Related Articles

data.modeling
cassandra

Search key of big partition in cassandra

John Doe

2/17/2023

data.modeling
cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

data.modeling