Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

10/5/2018

Reading time:6 min

Intro to Cassandra

by Tyler Hobbs

Intro to Cassandra SlideShare Explore You Successfully reported this slideshow.Intro to CassandraUpcoming SlideShareLoading in …5× 0 Comments 6 Likes Statistics Notes Radha Krishna Proddaturi , Principal Software Engineer at Coupons.com at Coupons.com Will Hayworth Yuvaraj Loganathan , Technical Architect at Owler at Owler Colin Kuo , Staff Engineer at Brocade Communications Systems at Brocade Tags cassandra Dave Kincaid , Managing Architect - Data Services/Data Science at IDEXX Laboratories at IDEXX Laboratories Show More No DownloadsNo notes for slide 1. Intro toCassandra Tyler Hobbs 2. HistoryDynamo BigTable(clustering) (data model) Cassandra 3. Users 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure 5. Consistent Hashing 0 50 10 40 20 30 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3 11. Clustering Client can talk to any node 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30 17. Node FailuresRF = 2 0 50 10 40 20 30 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100% 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100% 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100s of nodes Operationally simple Multi-Datacenter Replication 29. Data Model Comes from Google BigTable Goals – Minimize disk seeks – High throughput – Low latency – Durable 30. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table 31. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Pre-calculated query results – Materialized views 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com 33. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column” 34. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate 35. Dynamic Column Families Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed 36. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state 37. The Data API Two choices – RPC-based API – CQL • Cassandra Query Language 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24); 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”; 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”; 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”; 42. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)] 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)] 44. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)] 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”; 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”; 47. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency 48. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression 49. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data 50. Not One-size-fits-all Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest 51. Language Support Good: – Java – Python – Ruby – PHP – C# Coming Soon: – Everything else, now that we have CQL 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com Recommended Visual Thinking StrategiesOnline Course - LinkedIn Learning Creative Inspirations: Duarte Design, Presentation Design StudioOnline Course - LinkedIn Learning Gaining Skills with LinkedIn LearningOnline Course - LinkedIn Learning SC 2015: Thinking Fast and Slow with Software DevelopmentDaniel Bryant Cassandra for Python DevelopersTyler Hobbs Detect all memory leaks with LeakCanary!Pierre-Yves Ricau How Yelp Uses Sensu to Monitor Services in a SOA WorldKyle Anderson Evolving the Netflix APIKatharina Probst Datomic – A Modern Database - StampedeCon 2014StampedeCon 7 Common Mistakes in Go (2015)Steven Francia About Blog Terms Privacy Copyright LinkedIn Corporation © 2018 Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Intro to Cassandra

Successfully reported this slideshow.

Intro to Cassandra
Intro toCassandra  Tyler Hobbs
HistoryDynamo                     BigTable(clustering)               (data model)               Cassandra
Users
Clustering    Every node plays the same role    – No masters, slaves, or special nodes    – No single point of failure
Consistent Hashing           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                        Key: “www.google.com”           0                        md5(“www.google.com”)  ...
Clustering    Client can talk to any node
ScalingRF = 2             0              50        10The node at50 owns thered portion             20                   30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10                40        20                     30
Consistency, Availability    Consistency    – Can I read stale data?    Availability    – Can I write/read at all?    T...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must...
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation    – ...
Availability    Can tolerate the loss of:    – N – R replicas for reads    – N – W replicas for writes
CAP TheoremDuring node or network failure:          100%                                          Not                     ...
CAP TheoremDuring node or network failure:          100%                                                 Not              ...
Clustering    No single point of failure    Replication that works    Scales linearly    – 2x nodes = 2x performance   ...
Data Model    Comes from Google BigTable    Goals    – Minimize disk seeks    – High throughput    – Low latency    – Du...
Data Model    Keyspace    – A collection of Column Families    – Controls replication settings    Column Family    – Kin...
Column Families    Static    – Object data    – Similar to a table in a relational database    Dynamic    – Pre-calculat...
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families    Rows    – Each row has a unique primary key    – Sorted list of (name, value) tuples       • L...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Column Timestamps    – Each column (tuple) has a timestamp    – In the case of a collision, th...
Dynamic Column Families    Other Examples:    – Timeline of tweets by a user    – Timeline of tweets by all of the people...
The Data API    Two choices    – RPC-based API    – CQL       • Cassandra Query Language
Inserting Data INSERT INTO users (KEY, “name”, “age”)     VALUES (“thobbs”, “Tyler”, 24);
Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”)     VALUES (“thobbs”, 34); Or UPDATE users S...
Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data Explicit column select: SELECT “name”, “age” FROM users     WHERE KEY = “thobbs”;
Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e     WHERE KEY = “key”; SELECT 1..3 FROM le...
Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST ...
Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FI...
Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users...
Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS     WHERE age = 24 AN...
Performance    Writes    – 10k – 30k per second per node    – Sub-millisecond latency    Reads    – 1k – 10k per second ...
Other Features    Distributed Counters    – Can support millions of high-volume counters    Excellent Multi-datacenter S...
What Cassandra Cant Do    Transactions    – Unless you use a distributed lock    – Atomicity, Isolation    – These arent ...
Not One-size-fits-all    Use alongside an RDBMS    – Use the RDBMS for highly-transactional or highly-      relational da...
Language Support    Good:    – Java    – Python    – Ruby    – PHP    – C#    Coming Soon:    – Everything else, now tha...
Questions?          Tyler Hobbs               @tylhobbs       tyler@datastax.com

Upcoming SlideShare

Loading in …5

×

  1. 1. Intro toCassandra Tyler Hobbs
  2. 2. HistoryDynamo BigTable(clustering) (data model) Cassandra
  3. 3. Users
  4. 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  5. 5. Consistent Hashing 0 50 10 40 20 30
  6. 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  7. 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  8. 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  9. 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  11. 11. Clustering Client can talk to any node
  12. 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  13. 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  14. 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  15. 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  16. 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  17. 17. Node FailuresRF = 2 0 50 10 40 20 30
  18. 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  19. 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  20. 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  21. 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  22. 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  23. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  24. 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  25. 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  26. 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  27. 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  28. 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100s of nodes Operationally simple Multi-Datacenter Replication
  29. 29. Data Model Comes from Google BigTable Goals – Minimize disk seeks – High throughput – Low latency – Durable
  30. 30. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
  31. 31. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Pre-calculated query results – Materialized views
  32. 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  33. 33. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
  34. 34. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  35. 35. Dynamic Column Families Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
  36. 36. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  37. 37. The Data API Two choices – RPC-based API – CQL • Cassandra Query Language
  38. 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  39. 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  40. 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  41. 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  42. 42. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  43. 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  44. 44. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  45. 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  46. 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  47. 47. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
  48. 48. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  49. 49. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  50. 50. Not One-size-fits-all Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
  51. 51. Language Support Good: – Java – Python – Ruby – PHP – C# Coming Soon: – Everything else, now that we have CQL
  52. 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com

Related Articles

windows
cassandra
tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

Vladimir Kaplarevic

12/2/2023

cassandra
tutorial

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra