Successfully reported this slideshow.
Intro to Cassandra
Upcoming SlideShare
Loading in …5
×
No Downloads
No notes for slide
- 1. Intro toCassandra Tyler Hobbs
- 2. HistoryDynamo BigTable(clustering) (data model) Cassandra
- 3. Users
- 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
- 5. Consistent Hashing 0 50 10 40 20 30
- 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
- 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
- 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
- 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
- 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
- 11. Clustering Client can talk to any node
- 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
- 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
- 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
- 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
- 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
- 17. Node FailuresRF = 2 0 50 10 40 20 30
- 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
- 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
- 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
- 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
- 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
- 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
- 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
- 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
- 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
- 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
- 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100s of nodes Operationally simple Multi-Datacenter Replication
- 29. Data Model Comes from Google BigTable Goals – Minimize disk seeks – High throughput – Low latency – Durable
- 30. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
- 31. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Pre-calculated query results – Materialized views
- 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
- 33. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
- 34. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
- 35. Dynamic Column Families Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
- 36. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
- 37. The Data API Two choices – RPC-based API – CQL • Cassandra Query Language
- 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
- 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
- 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
- 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
- 42. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
- 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
- 44. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
- 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
- 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
- 47. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
- 48. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
- 49. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
- 50. Not One-size-fits-all Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
- 51. Language Support Good: – Java – Python – Ruby – PHP – C# Coming Soon: – Everything else, now that we have CQL
- 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com
Public clipboards featuring this slide
No public clipboards found for this slide