Introduction to Apache Cassandra

Successfully reported this slideshow.

Introduction to Apache Cassandra
Introduction to 
Apache 
1
Me 
Robert Stupp 
Freelancer, Coder, Architect 
@snazy snazy@snazy.de 
Contributor to Apache Cassandra, 
3.0 UDFs (CASSAND...
Agenda 
Apache Cassandra History 
Design Principles 
Outstanding differences 
CQL Intro 
Access C* 
Clusters 
Cassandra Fu...
Apache Cassandra 
History 
4
Apache Cassandra 
started at Facebook 
inspired by 
Note: Facebook initially had 
two data centers. 
5
2.1 released in Sep 2014 
6
Apache Cassandra 
Design Principles 
7
Hardware failures 
can and will occur! 
Cassandra handles failures. 
From single node to whole data center. 
From client t...
The complicated part 
when learning Cassandra, 
is to understand 
Cassandra’s simplicity 
9
Keep it simple 
all nodes are equal 
master-less architecture 
no name nodes 
no SPOF (single point of failure) 
no read b...
Keep it running 
No need to take cluster down … e.g. 
during maintenance 
during software update 
Rolling restart is your ...
Outstanding 
Differences 
12
Cassandra 
Highly scalable 
runs with a few nodes 
up to 1000+ nodes cluster! 
Linear scalability (proven!) 
Multi datacen...
Cassandra @ Apple 
14
Linear Scalability 
15
Scaling Cassandra 
More data? 
-> add more nodes 
Faster access? 
-> add more nodes 
16
Read / Write 
performance 
Reads are fast 
Writes are even faster 
17
Durability 
Writes are durable - period. 
18
Availability @ 
Netflix 
19 
Chaos 
Monkey 
kills nodes randomly
Availability @ 
Netflix 
20 
Chaos 
Gorilla 
kill regions randomly
Availability @ 
Netflix 
Chaos 
Kong 
kills whole data centers 
21
Availability @ 
Netflix 
http://de.slideshare.net/planetcassandra/ 
active-active-c-behind-the-scenes-at-netflix 
22
32 node cluster (Rasperry PIs) 
@DataStax 
23
Most outstanding 
Great documentation 
Many blog posts 
Many presentations 
Many videos 
Regular webinars 
Huge, active an...
Data Distribution 
25
DHT 
Data is organized in a 
„Distributed Hash Table“ 
(hash over row key) 
26
DHT 
0 
27 
1 
2 
3 
4 
5 
6 
7
Replication 
28
Replication Factor 2 
0 
29 
1 
2 
3 
4 
5 
6 
7 
Row A 
Row B
Replication Factor 3 
0 
30 
1 
2 
3 
4 
5 
6 
7 
Row A 
Row B
Consistency 
Consistency defined per request 
Several consistency levels (CLs) 
for different needs 
31
Eventual consistency 
is not 
hopefully consistent 
EC means there’s a time gap until updates 
are consistently readable 
...
Consistency Levels 
ANY (only for writes) 
ONE, LOCAL_ONE, 
TWO, THREE, (not recommended) 
ALL, (not recommended) 
QUORUM,...
Consistency 
Data is always replicated 
CL defines how many replicas must 
fulfill the request 
34
Write 
0 
35 
1 
2 
3 
4 
5 
6 
7 
Write
Write 
0 
36 
1 
2 
3 
4 
5 
6 
7 
Write
Mutli DC setup 
DC 1 DC 2 
37
Multi DC replication 
38 
Write 
DC 1 DC 2
Mutli DC replication 
39 
Write 
DC 1 DC 2
Mutli DC replication 
40 
Write 
DC 1 DC 2
Replication & 
Consistency 
Define # of replicas 
using replication factor 
Define required consistency 
per request 
41
CQL Introduction 
CQL = Cassandra query language 
42
“CQL is SQL 
minus joins, 
minus subqueries, 
plus collections” 
(plus user types, 
plus tuple types) 
43
Why CQL? 
Introduces a schema to Cassandra 
Familiar syntax 
Easy to understand 
DML operations are atomic 
44
Data model 
(hierarchical view) 
Keyspace (schema) 
Table (column family) 
Row 
partition key (part of primary key) 
stati...
CQL / DDL 
Similar to SQL 
CREATE TABLE … 
ALTER TABLE … 
DROP TABLE … 
46
CQL / DML 
Similar to SQL 
INSERT … 
UPDATE … 
DELETE … 
SELECT … 
47
CQL / BATCH 
Group related modifications 
(INSERT, UPDATE, DELETE) 
Atomic operation 
48
CQL types 
boolean, int (32bit), bigint (64bit), 
float, double, 
decimal ("BigDecimal"), 
varint ("BigInteger"), 
ascii, ...
CQL collection 
types 
list < foo > 
set < foo > 
map < foo , bar > 
Since C* 2.1 collections can contain 
any type - even...
CQL composite 
types 
user types (C* 2.1) 
are composite types with named fields 
tuple types (C* 2.1) 
are unstructured l...
CQL / user types 
CREATE TYPE address ( 
street text, 
zip int, 
city text); 
CREATE TABLE users ( 
username text, 
addres...
Cassandra 
Data Modeling 
Access by key 
no access by arbitrary WHERE clause 
Duplicate data (it’s ok!) 
Aggregate data 
B...
RDBMS modeling 
54
C* modeling 
55
Data Modeling 
with RDBMS 
Driven by 
"How can I store 
something right?" 
"What answers 
do I have?" 
56
Data Modeling 
with NoSQL 
Driven by 
"How can I access 
something right?" 
"What questions 
do I have?" 
57
Data Modeling 
Basics 
Work top-down. Think about: 
What does the application do? 
What are the access patterns? 
Now desi...
Data Modeling 
http://de.slideshare.net/planetcassandra/ 
cassandra-day-sv-2014-fundamentals-of- 
apache-cassandra-data-mo...
Accessing 
Cassandra 
60
Command Line 
cqlsh 
CQL shell 
nodetool 
node/cluster administration 
61
GUI: DevCenter 
Visual query tool 
62
Stress test? 
Cassandra 2.1 comes with improved 
stress tool 
Simulate read+write workload 
Uses configurable data 
Works ...
DataStax APLv2 
Open Source Drivers 
for Java 
for Python 
for C# 
for Scala / Spark 
https://github.com/datastax/ 
or htt...
Native protocol 
C*’s own net protocol for clients 
Request multiplexing 
Schema change notifications 
Cluster change noti...
Third Party Drivers 
for huge number of languages 
66
Mappers 
High level mappers exist at least for 
Java 
Special case: Scala 
due to its strong+complex type 
model (DataStax...
Spark + Hadoop 
Yes - works really good 
Note: Spark is about 100x faster 
68
Clusters 
69
Cluster sizes 
C* works with a few nodes 
C* works with several hundred / 
thousand nodes 
70
Cluster setup 
Configure for multiple data centers 
Plan for multi-DC setup :) 
71
Cluster experience 
Remember: A single Cassandra 
clusters works over multiple data 
centers all over the world 
„Desaster...
Apache Cassandra 
Future 
73
Cassandra 3.0 
(in development) 
User Defined Functions 
Aggregate functions 
Functional indexes 
Workload recording + pla...
Get active ! 
75
Cassandra Community 
http://cassandra.apache.org/ 
http://planetcassandra.org/ - Blog 
http://www.slideshare.net/ 
planetc...
Cassandra Community 
https://www.youtube.com/user/ 
PlanetCassandra 
https://www.youtube.com/user/DataStax 
http://www.dat...
Free C* Training! 
http://planetcassandra.org/cassandra-training/ 
78
Get involved! 
Ask questions, 
submit RFEs or experiences to 
user mailing list 
user@cassandra.apache.org 
Answers arrive...
Live Demo 
User Defined Functions 
80
C* 3.0 UDFs 
Users create functions using 
CREATE FUNCTION … 
LANGUAGE … 
AS … 
Java, JavaScript, Scala, Groovy, 
JRuby, J...
C* 3.0 UDFs 
Example 
CREATE FUNCTION sin(input double) 
RETURNS double 
LANGUAGE javascript 
AS 'Math.sin(input)'; 
82 
T...
UDFs for what? 
Own aggregation code - e.g. 
SELECT sum(value) FROM table 
WHERE …; 
Functional indexes - e.g. 
CREATE IND...
Thanks 
for your attention 
Download Apache Cassandra at 
http://cassandra.apache.org/ 
Robert Stupp 
@snazy 
snazy@snazy....
Q & A 
85
86
BACKUP SLIDES 
User-Defined-Functions 
Demo 
87
88
89
90
91
92
93
94
95
96
97
98
99

Upcoming SlideShare

Loading in …5

×

  1. 1. Introduction to Apache 1
  2. 2. Me Robert Stupp Freelancer, Coder, Architect @snazy snazy@snazy.de Contributor to Apache Cassandra, 3.0 UDFs (CASSANDRA-7395 + related) Databases, Network, Backend 2
  3. 3. Agenda Apache Cassandra History Design Principles Outstanding differences CQL Intro Access C* Clusters Cassandra Future 3
  4. 4. Apache Cassandra History 4
  5. 5. Apache Cassandra started at Facebook inspired by Note: Facebook initially had two data centers. 5
  6. 6. 2.1 released in Sep 2014 6
  7. 7. Apache Cassandra Design Principles 7
  8. 8. Hardware failures can and will occur! Cassandra handles failures. From single node to whole data center. From client to server. 8
  9. 9. The complicated part when learning Cassandra, is to understand Cassandra’s simplicity 9
  10. 10. Keep it simple all nodes are equal master-less architecture no name nodes no SPOF (single point of failure) no read before modify (prevent race conditions) 10
  11. 11. Keep it running No need to take cluster down … e.g. during maintenance during software update Rolling restart is your friend 11
  12. 12. Outstanding Differences 12
  13. 13. Cassandra Highly scalable runs with a few nodes up to 1000+ nodes cluster! Linear scalability (proven!) Multi datacenter aware (world-wide!) No SPOF 13
  14. 14. Cassandra @ Apple 14
  15. 15. Linear Scalability 15
  16. 16. Scaling Cassandra More data? -> add more nodes Faster access? -> add more nodes 16
  17. 17. Read / Write performance Reads are fast Writes are even faster 17
  18. 18. Durability Writes are durable - period. 18
  19. 19. Availability @ Netflix 19 Chaos Monkey kills nodes randomly
  20. 20. Availability @ Netflix 20 Chaos Gorilla kill regions randomly
  21. 21. Availability @ Netflix Chaos Kong kills whole data centers 21
  22. 22. Availability @ Netflix http://de.slideshare.net/planetcassandra/ active-active-c-behind-the-scenes-at-netflix 22
  23. 23. 32 node cluster (Rasperry PIs) @DataStax 23
  24. 24. Most outstanding Great documentation Many blog posts Many presentations Many videos Regular webinars Huge, active and healthy community 24
  25. 25. Data Distribution 25
  26. 26. DHT Data is organized in a „Distributed Hash Table“ (hash over row key) 26
  27. 27. DHT 0 27 1 2 3 4 5 6 7
  28. 28. Replication 28
  29. 29. Replication Factor 2 0 29 1 2 3 4 5 6 7 Row A Row B
  30. 30. Replication Factor 3 0 30 1 2 3 4 5 6 7 Row A Row B
  31. 31. Consistency Consistency defined per request Several consistency levels (CLs) for different needs 31
  32. 32. Eventual consistency is not hopefully consistent EC means there’s a time gap until updates are consistently readable 32
  33. 33. Consistency Levels ANY (only for writes) ONE, LOCAL_ONE, TWO, THREE, (not recommended) ALL, (not recommended) QUORUM, LOCAL_QUORUM, EACH_QUORUM SERIAL, LOCAL_SERIAL 33
  34. 34. Consistency Data is always replicated CL defines how many replicas must fulfill the request 34
  35. 35. Write 0 35 1 2 3 4 5 6 7 Write
  36. 36. Write 0 36 1 2 3 4 5 6 7 Write
  37. 37. Mutli DC setup DC 1 DC 2 37
  38. 38. Multi DC replication 38 Write DC 1 DC 2
  39. 39. Mutli DC replication 39 Write DC 1 DC 2
  40. 40. Mutli DC replication 40 Write DC 1 DC 2
  41. 41. Replication & Consistency Define # of replicas using replication factor Define required consistency per request 41
  42. 42. CQL Introduction CQL = Cassandra query language 42
  43. 43. “CQL is SQL minus joins, minus subqueries, plus collections” (plus user types, plus tuple types) 43
  44. 44. Why CQL? Introduces a schema to Cassandra Familiar syntax Easy to understand DML operations are atomic 44
  45. 45. Data model (hierarchical view) Keyspace (schema) Table (column family) Row partition key (part of primary key) static columns clustering key (part of primary key) columns 45
  46. 46. CQL / DDL Similar to SQL CREATE TABLE … ALTER TABLE … DROP TABLE … 46
  47. 47. CQL / DML Similar to SQL INSERT … UPDATE … DELETE … SELECT … 47
  48. 48. CQL / BATCH Group related modifications (INSERT, UPDATE, DELETE) Atomic operation 48
  49. 49. CQL types boolean, int (32bit), bigint (64bit), float, double, decimal ("BigDecimal"), varint ("BigInteger"), ascii, text (= varchar), blob, inet, timestamp, uuid, timeuuid 49
  50. 50. CQL collection types list < foo > set < foo > map < foo , bar > Since C* 2.1 collections can contain any type - even other collections. 50
  51. 51. CQL composite types user types (C* 2.1) are composite types with named fields tuple types (C* 2.1) are unstructured lists of values 51
  52. 52. CQL / user types CREATE TYPE address ( street text, zip int, city text); CREATE TABLE users ( username text, addresses map<text, address>, ... 52
  53. 53. Cassandra Data Modeling Access by key no access by arbitrary WHERE clause Duplicate data (it’s ok!) Aggregate data Build application maintained indexes 53
  54. 54. RDBMS modeling 54
  55. 55. C* modeling 55
  56. 56. Data Modeling with RDBMS Driven by "How can I store something right?" "What answers do I have?" 56
  57. 57. Data Modeling with NoSQL Driven by "How can I access something right?" "What questions do I have?" 57
  58. 58. Data Modeling Basics Work top-down. Think about: What does the application do? What are the access patterns? Now design data model 58
  59. 59. Data Modeling http://de.slideshare.net/planetcassandra/ cassandra-day-sv-2014-fundamentals-of- apache-cassandra-data-modeling http://de.slideshare.net/planetcassandra/ data-modeling-with-travis-price 59
  60. 60. Accessing Cassandra 60
  61. 61. Command Line cqlsh CQL shell nodetool node/cluster administration 61
  62. 62. GUI: DevCenter Visual query tool 62
  63. 63. Stress test? Cassandra 2.1 comes with improved stress tool Simulate read+write workload Uses configurable data Works against older C* versions, too 63
  64. 64. DataStax APLv2 Open Source Drivers for Java for Python for C# for Scala / Spark https://github.com/datastax/ or http://www.datastax.com/download 64
  65. 65. Native protocol C*’s own net protocol for clients Request multiplexing Schema change notifications Cluster change notifications 65
  66. 66. Third Party Drivers for huge number of languages 66
  67. 67. Mappers High level mappers exist at least for Java Special case: Scala due to its strong+complex type model (DataStax OSS Spark driver) 67
  68. 68. Spark + Hadoop Yes - works really good Note: Spark is about 100x faster 68
  69. 69. Clusters 69
  70. 70. Cluster sizes C* works with a few nodes C* works with several hundred / thousand nodes 70
  71. 71. Cluster setup Configure for multiple data centers Plan for multi-DC setup :) 71
  72. 72. Cluster experience Remember: A single Cassandra clusters works over multiple data centers all over the world „Desaster proven“ Hurricanes Amazon DC outages 72
  73. 73. Apache Cassandra Future 73
  74. 74. Cassandra 3.0 (in development) User Defined Functions Aggregate functions Functional indexes Workload recording + playback Better SSTables, Fully off-heap row cache, Better serial consistency Indexes w/ high cardinality 74 Subject to change!!!
  75. 75. Get active ! 75
  76. 76. Cassandra Community http://cassandra.apache.org/ http://planetcassandra.org/ - Blog http://www.slideshare.net/ planetcassandra/presentations http://de.slideshare.net/DataStax/ presentations 76
  77. 77. Cassandra Community https://www.youtube.com/user/ PlanetCassandra https://www.youtube.com/user/DataStax http://www.datastax.com/dev/blog/ http://www.datastax.com/docs/ Users Mailing List users@cassandra.apache.org 77
  78. 78. Free C* Training! http://planetcassandra.org/cassandra-training/ 78
  79. 79. Get involved! Ask questions, submit RFEs or experiences to user mailing list user@cassandra.apache.org Answers arrive quickly! 79
  80. 80. Live Demo User Defined Functions 80
  81. 81. C* 3.0 UDFs Users create functions using CREATE FUNCTION … LANGUAGE … AS … Java, JavaScript, Scala, Groovy, JRuby, Jython Functions work on all nodes 81
  82. 82. C* 3.0 UDFs Example CREATE FUNCTION sin(input double) RETURNS double LANGUAGE javascript AS 'Math.sin(input)'; 82 This is JavaScript!
  83. 83. UDFs for what? Own aggregation code - e.g. SELECT sum(value) FROM table WHERE …; Functional indexes - e.g. CREATE INDEX idx ON table ( myFunction(colname) ); 83 Targeted for C* 3.0
  84. 84. Thanks for your attention Download Apache Cassandra at http://cassandra.apache.org/ Robert Stupp @snazy snazy@snazy.de de.slideshare.net/RobertStupp 84
  85. 85. Q & A 85
  86. 86. 86
  87. 87. BACKUP SLIDES User-Defined-Functions Demo 87
  88. 88. 88
  89. 89. 89
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. 94
  95. 95. 95
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99

×