Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

3/16/2019

Reading time:2 min

Cassandra Data Modeling Notes

by by Miguel Perez

Architecture and OperationArchitecturePerformanceMaintenanceMonitoringData ModelingModeling principlesGood to knowCalculate partition sizeEstimate partition disk spaceModeling principlesKnow your dataKnow your queryNest DataDuplicate dataModeling Good to knowQuery-driven data modeling. Application queries should be build before creating a data model.Every query should have its own table.Name tables after queries ex (videos_by_tag_added_year).It’s okay to have duplicated data across multiple tables. We don’t care about space, we care about speed.Cassandra will do an upsert if the primary key and clustering key in a cell are not unique.Primary keys, clustering keys and clustering order can’t be change after a table is create. You will need need to create a new table and move old data to new table.You can only do range queries on clustering keys.Make sure data is duplicate at a constant duplication. If data is duplicated 25 x N, limit N to make the duplication factor constant.A partition can only have 2 billion cells. You are likely to hit performance issues before hitting this limit.A partition should only be hundreds of megabytes on diskA cell or column is a key-value pair.Trade-off between efficiency and space. Use client side joins and don’t duplicate data.There are no JOINs in cassandra.Use BATCH statement to update or insert all duplicated data.Cassandra has SQL like sytanx call CQL for creating and manipulating.Calculate partition size The formula below can be use to calculate how big a partition will get to overtime. If a partition gets bigger than 2 billion cells, performance will be affected.Nr -> Number of rowsNc -> Number of regular columsNpk -> Number of primary keysNs -> Number of static columnsNv -> Number of values Nv = 40000 x (7 - 3 - 0) + 0 Nv = 40000 x 4 + 0 Nv = 160,000Estimate table disk space Ck -> Partition keyCs -> Static columnsCr -> Regular columnCc -> Clustering columnNr -> Number of rowsNv -> See formula aboveSt -> Size of a tablesizeOf -> Estimate size of column St = 16 + 0 + 40000 x ((55 + (8 + 16)) + (12 + (8 + 16)) + (30 + (8 + 16)) + (2340 + (8 + 16))) + 8 x 160000St = 16 + 0 + 40000 x 2533 + 8 x 160000St = 16 + 0 + 101320000 + 1280000St = 102,600,016 bytesResourceshttps://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412https://academy.datastax.com

Illustration Image

Modeling principles

  • Know your data
  • Know your query
  • Nest Data
  • Duplicate data

Modeling Good to know

  • Query-driven data modeling. Application queries should be build before creating a data model.
  • Every query should have its own table.
  • Name tables after queries ex (videos_by_tag_added_year).
  • It’s okay to have duplicated data across multiple tables. We don’t care about space, we care about speed.
  • Cassandra will do an upsert if the primary key and clustering key in a cell are not unique.
  • Primary keys, clustering keys and clustering order can’t be change after a table is create. You will need need to create a new table and move old data to new table.
  • You can only do range queries on clustering keys.
  • Make sure data is duplicate at a constant duplication. If data is duplicated 25 x N, limit N to make the duplication factor constant.
  • A partition can only have 2 billion cells. You are likely to hit performance issues before hitting this limit.
  • A partition should only be hundreds of megabytes on disk
  • A cell or column is a key-value pair.
  • Trade-off between efficiency and space. Use client side joins and don’t duplicate data.
  • There are no JOINs in cassandra.
  • Use BATCH statement to update or insert all duplicated data.
  • Cassandra has SQL like sytanx call CQL for creating and manipulating.

Calculate partition size

The formula below can be use to calculate how big a partition will get to overtime. If a partition gets bigger than 2 billion cells, performance will be affected.

  • Nr -> Number of rows
  • Nc -> Number of regular colums
  • Npk -> Number of primary keys
  • Ns -> Number of static columns
  • Nv -> Number of values
D4E82075A7038CC2E116798E111DB94D.png
  • Nv = 40000 x (7 - 3 - 0) + 0
  • Nv = 40000 x 4 + 0
  • Nv = 160,000

Estimate table disk space

  • Ck -> Partition key
  • Cs -> Static columns
  • Cr -> Regular column
  • Cc -> Clustering column
  • Nr -> Number of rows
  • Nv -> See formula above
  • St -> Size of a table
  • sizeOf -> Estimate size of column
  • St = 16 + 0 + 40000 x ((55 + (8 + 16)) + (12 + (8 + 16)) + (30 + (8 + 16)) + (2340 + (8 + 16))) + 8 x 160000
  • St = 16 + 0 + 40000 x 2533 + 8 x 160000
  • St = 16 + 0 + 101320000 + 1280000
  • St = 102,600,016 bytes
Resources

Related Articles

data.modeling
cassandra

Search key of big partition in cassandra

John Doe

2/17/2023

data.modeling
cassandra
spark

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

data.modeling