Cassandra advanced data modeling

Successfully reported this slideshow.

Cassandra advanced data modeling
Cassandra
Advanced
data modeling
Lyon Cassandra Users
Romain Hardouin
2016-05-31
$ who
Romain
$ pgrep -fl work
Cassandra architect
$ whatis teads
No.1 Video Advertising Marketplace
I. Introduction
II. Key principles
III. Chebotko methodology
IV. Time handling
Data modeling
I. Introduction
Theory
Theory
Chebotko diagrams
E&R
II. Key principles
Know your
data
DenormalizeKnow your
queries
Key Principles
Nest Data
Duplicate Data
Know your domain
Conceptual Data Model, E&R
●
Entities
●
Relationships
●
Attributes / Keys
●
Cardinalities
●
Constraints
K...
Entities &
relationships
Know your
data
Query-driven model
Application Workflow
New needs?
●
New queries => new tables
●
Alter table possible?
Know your
data
Know...
Goal: one partition per query
Anti-pattern:
●
Table scan
●
Client joins (a.k.a multi-table)
●
Secondary index
●
Allow filt...
Nest Data
Clustering columns
Collection columns
UDT columns
Know your
data
Denormalize
Nest Data
Know your
data
Denormalize
CREATE TABLE actors_by_video (
video_id uuid,
actor_name text,
character_name text,
P...
Duplicate data
Writes are cheap: « Joins on write »
Duplication occurs at different levels:
●
Table: Materialized views
●
...
III. Chebotko Methodology
From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra »...
From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra »...
actors_by_video
video_id uuid K
actor_name text C↑
character_name text C↑
CREATE TABLE actors_by_video (
video_id uuid,
ac...
MR 1
Entities & Relationships
MR 2
Equality
search attributes
MR 3
Inequality
search attribues
Chebotko mapping rules
MR 5...
From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra »...
Internet of Things
Demo
Kashlev Data Modeler
IV. Time handling
- Tombstones
- TTL
- UPSERTs
IV. Time handling
- Tombstones
- TTL
- UPSERTs
Eventually consistency
No instant deletes
Deletes are writes
SSTables are immutable files
Writes are spread across many fi...
Goal: avoid to read too many* tombstones
...
...
* see tombstone_warn_threshold & tombstone_failure_threshold
IV. Time handling
- Tombstones
- TTL
- UPSERTs
TTLsTTLs
Data must be designed to be TTL'ed
tombstones
Why?
What we add?
TIMEdimension
IV. Time handling
- Tombstones
- TTL
- UPSERTs
UPSERTsUPSERTs
Same INSERT over and over again?
UPSERTs hide this behavior
What if… one day you want to add time
Questions?
Resources
« A Big Data Modeling Methodology for Apache Cassandra »
- Artem Chebotko, Andrey Kashlev & Shiyong Lu
- www.cs....
Cassandra advanced data modeling
Cassandra advanced data modeling

Upcoming SlideShare

Loading in …5

×

  1. 1. Cassandra Advanced data modeling Lyon Cassandra Users Romain Hardouin 2016-05-31
  2. 2. $ who Romain $ pgrep -fl work Cassandra architect $ whatis teads No.1 Video Advertising Marketplace
  3. 3. I. Introduction II. Key principles III. Chebotko methodology IV. Time handling Data modeling
  4. 4. I. Introduction
  5. 5. Theory
  6. 6. Theory Chebotko diagrams E&R
  7. 7. II. Key principles
  8. 8. Know your data DenormalizeKnow your queries Key Principles Nest Data Duplicate Data
  9. 9. Know your domain Conceptual Data Model, E&R ● Entities ● Relationships ● Attributes / Keys ● Cardinalities ● Constraints Know your data
  10. 10. Entities & relationships Know your data
  11. 11. Query-driven model Application Workflow New needs? ● New queries => new tables ● Alter table possible? Know your data Know your queries
  12. 12. Goal: one partition per query Anti-pattern: ● Table scan ● Client joins (a.k.a multi-table) ● Secondary index ● Allow filtering Know your data Know your queries
  13. 13. Nest Data Clustering columns Collection columns UDT columns Know your data Denormalize
  14. 14. Nest Data Know your data Denormalize CREATE TABLE actors_by_video ( video_id uuid, actor_name text, character_name text, PRIMARY KEY ((video_id), actor_name, character_name) );
  15. 15. Duplicate data Writes are cheap: « Joins on write » Duplication occurs at different levels: ● Table: Materialized views ● Partition ● Rows Know your data Denormalize
  16. 16. III. Chebotko Methodology
  17. 17. From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra » Application workflowApplication workflow Query workflow Query list
  18. 18. From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra » Chebotko DiagramChebotko Diagram
  19. 19. actors_by_video video_id uuid K actor_name text C↑ character_name text C↑ CREATE TABLE actors_by_video ( video_id uuid, actor_name text, character_name text, PRIMARY KEY ((video_id), actor_name, character_name) ); Chebotko DiagramChebotko Diagram
  20. 20. MR 1 Entities & Relationships MR 2 Equality search attributes MR 3 Inequality search attribues Chebotko mapping rules MR 5 Key attributes, uniqueness MR 4 Ordering attributes <>= ↑↓
  21. 21. From « A Big Data Modeling Methodology for Apache Cassandra »From « A Big Data Modeling Methodology for Apache Cassandra » Chebotko mapping rulesChebotko mapping rules
  22. 22. Internet of Things Demo Kashlev Data Modeler
  23. 23. IV. Time handling - Tombstones - TTL - UPSERTs
  24. 24. IV. Time handling - Tombstones - TTL - UPSERTs
  25. 25. Eventually consistency No instant deletes Deletes are writes SSTables are immutable files Writes are spread across many files
  26. 26. Goal: avoid to read too many* tombstones ... ... * see tombstone_warn_threshold & tombstone_failure_threshold
  27. 27. IV. Time handling - Tombstones - TTL - UPSERTs
  28. 28. TTLsTTLs Data must be designed to be TTL'ed tombstones
  29. 29. Why? What we add?
  30. 30. TIMEdimension
  31. 31. IV. Time handling - Tombstones - TTL - UPSERTs
  32. 32. UPSERTsUPSERTs Same INSERT over and over again? UPSERTs hide this behavior What if… one day you want to add time
  33. 33. Questions?
  34. 34. Resources « A Big Data Modeling Methodology for Apache Cassandra » - Artem Chebotko, Andrey Kashlev & Shiyong Lu - www.cs.wayne.edu/andrey/papers/TR-BIGDATA-05-2015-CKL.pdf KDM - Andrey Kashlev - kdm.dataview.org

×