Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

9/23/2020

Reading time:5 min

Cassandra Architecture FTW

by Jeffrey Carpenter

Cassandra Architecture FTW SlideShare Explore You Successfully reported this slideshow.Cassandra Architecture FTWUpcoming SlideShareLoading in …5× 1 Comment 1 Like Statistics Notes slachiewicz No DownloadsNo notes for slide 1. CassandraArchitectureFTW! 2. CassandraArchitectureFTW! 3. Cassandra won because it solved aproblem that hadn’t yet been solved 4. What’s wrong with RDBMS?• Pros• Relational data modeling is wellunderstood• SQL is easy to use andubiquitous• ACID transactions - data integrity• Cons• Scaling is hard• Sharding and replication haveside-effects (performance,reliability, cost)• Denormalize to get performancegainsFor fun: Relational Scaling and the Temple of Gloom 5. The NoSQL Revolution (~2009)• Key-value – Dynamo, Riak, Voldemort, Redis,Memcached• Column-oriented – BigTable, Hbase• Document – Mongo DB, Document DB, CouchDB• Graph – Neo4J, DSE Graph• Multi-model – DataStax Enterprise, CosmosDB• Relational counter-revolution – AWS Aurora, GoogleSpanner 6. Apache Cassandra Architecture Qualities• Distributed,decentralized• Elastic scalability• Commodity hardware?• High performance• High availability / faulttolerant• Tuneable consistency Apache Cassandra ® Apache Software Foundation 7. Two lies and a truth• Cassandra is column-oriented• Cassandra is a schemaless database• Cassandra is eventually consistentThey’re all lies, in a way 8. Problems Cassandra is Especially Good At• Large scale storage• >10s of TB• Lots of writes• Time-series data, IoT• Statistics and analytics• For example, as a Spark datasource• Geographic distribution• Multiple data centersPersonalizationCustomer360RecommendationFraudDetectionInventoryManagementIdentityManagementSecuritySupplyChain 9. A system must be designed for distributionfrom the beginning in order to scale the mosteffectively© DataStax, All Rights Reserved.10 10. Cluster Topology• Organization– Nodes– Racks– Data Centers• Goals– Distribute copies (replicas) forhigh availability– Route queries to nearby nodesfor high performance• Approaches– Gossip– Snitches 11. Clusters and Rings• Organization– Tokens– Token Ring• Goals– Distribute data evenly acrossnodes• Approaches– Partitioners– Virtual nodes 12. Replication and Consistency• Organization– Client (with driver)– Coordinator Node– Replica nodes• Goals– Consistent data even whennodes are down/unresponsive• Approaches– Replication Factor / Strategy– Consistency Level 13. Cassandra works because it leveragesproven distributed system design patterns 14. Wait, who is this guy? 15. Proven Distributed System Techniques• Gossip• Failure detection (Phi)• Partitioning• Leaderless replication• Hinted Handoff• Bloom filters• Consensus algorithms(Paxos)• Log Structured Merge StorageEngine– Memtables– SSTables– Commit log– Compaction– Tombstones• Thrift API• SEDA 16. Log Structured Merge Storage Engine 17. Cassandra’s data model is the key enablerof its high performance and scalability 18. Terminology - Cassandra’s Data Model 19. Terminology – Partition and Clustering Keys 20. Partition and Clustering Key ExampleLabeling:K – partition keyC – clustering keyPRIMARY KEY ((customer_id), contact_time) 21. Cassandra survived because it didn’tover-extend itself 22. Tuneable Consistency and Consistency Levels• ONE, TWO, THREE• Useful for speed• ANY (Write only, use with care)• ALL• Number of nodes to respond =RF• Overly restrictive?• QUORUM• (RF / 2) + 1• Frequently very useful• LOCAL_ONE,LOCAL_QUORUM– Similar to above, but nodes mustbe in local data center• EACH_QUORUM– Quorum of nodes must respondin each data center 23. Strong Consistency vs. Eventual Consistency• Eventual consistency• i.e. W  ONE, R  QUORUM• Use cases• Write heavy• Data not read immediately• Strong Consistency• i.e. W QUORUM, R QUORUM• Use cases• Read after write• Data loading with validation• Strong Consistency Formula• R + W > RF = strong consistency• R: read replica count required byconsistency level• W: the write replica countrequired by consistency level• RF: replication factor• Example: W QUORUM, R QUORUM, RF = 3• 2 + 2 > 3• Implication: all client reads willsee the most recent write 24. Inventory and Tuneable Consistency© DataStax, All Rights Reserved.28Approach Example ScopeQUORUM consistencyfor reads and writesEnsuring latest inventory countsare always readData TierLightweight Transaction Updating inventory counts Data TierLogged BatchWriting to multiple denormalizedtablesData TierRetrying failed callsData synchronization,reservation processingService /Application TierCompensatingprocessesVerifying reservation processing SystemCustomer serviceremediationRebooking, back order,substitutionSystemEventualconsistencyStrongconsistency 25. CoreapplicationdataMicroservices and Polyglot Persistence© DataStax, All Rights Reserved.29Service AServiceBTabular Key-valueService CRelationalDocument GraphServiceDServiceEReferencedataContentCustomerrelationshipdataLegacy, lowvolume data 26. Cassandra will remain viable for many yearsbecause of its extensibility and pluggability 27. SnitchPartitionerCompactionReplicationSecondaryIndexStorageEngine?Auth/Auth 28. Cassandra’s deployment flexibility can’t bematched by cloud vendor database offerings 29. Consistency Level and DeploymentActive-Active, Multi-Region– EACH_QUORUM may limit availability if connection to one region/data center is down– LOCAL_QUORUM, relying on Cassandra to complete writes to remote data centers– QUORUM is a reasonable middle-ground approach 30. Consistency Level and DeploymentSeparate data center for analytics– Writes in ”online DC” – LOCAL_QUORUM• Writes to analytics DC in background• Analytic DC availability/performance decoupled from online DC• EACH_QUORUM, QUORUM overkill, unless “real-time” analytics required– Reads in “analytics DC” – LOCAL_QUORUM• Or even LOCAL_ONE 31. Cassandra is most powerful when combinedwith complementary technologies to form adata platform 32. Spark + Cassandra• Access Cassandra fromSpark via DataStax connectorDataStax is a registered trademarkof DataStax, Inc. and itssubsidiaries in the United Statesand/or other countries.36• Co-locate Spark andCassandraSparkSQLSparkStreamingMLibGraphXSpark RSpark Core EngineDataStax Spark-Cassandra ConnectorCassandra 33. DSE AnalyticsDataStax is a registered trademarkof DataStax, Inc. and itssubsidiaries in the United Statesand/or other countries.ApplicationReal Time OperationsCassandraAnalyticsAnalyticsQueriesYour AnalyticsReal Time ReplicationSingle DSE CusterStreaming, ad-hoc, and batch• High-performance• Workload management• SQL reportingCompared to self-managedSpark cluster:• No ETL• True HA without Zookeeper 34. DSE Core - Certified Apache Cassandra• The best distribution of Apache Cassandra™• Production certified Cassandra• Performance improvements• Advanced Security• Multi-tenancy through row-level access control• Advanced Replication• Great for retail and IoT use casesDataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United Statesand/or other countries. 35. https://github.com/killrvideoKillrVideo – a reference application 36. https://academy.datastax.comDataStax Academy – a place to learn 37. 41Contactjeff.carpenter@datastax.com@jscarpjeffreyscarpentermedium.com/@jscarp Recommended Patterns for Persistence and Streaming in Microservice ArchitecturesJeffrey Carpenter Creating a Python Microservice Tier in Four Sprints with Cassandra, Kafka, an...Jeffrey Carpenter Deploying Cassandra Multi-cloudJeffrey Carpenter Getting the Most Out of CassandraJeffrey Carpenter Data Model Meets WorldJeffrey Carpenter Can My Inventory Survive Eventual Consistency?Jeffrey Carpenter Data Modeling for Microservices with Cassandra and SparkJeffrey Carpenter About Blog Terms Privacy Copyright × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Cassandra Architecture FTW

Successfully reported this slideshow.

Cassandra Architecture FTW
Cassandra
Architecture
FTW!
Cassandra
Architecture
FTW!
Cassandra won because it solved a
problem that hadn’t yet been solved
What’s wrong with RDBMS?
• Pros
• Relational data modeling is well
understood
• SQL is easy to use and
ubiquitous
• ACID t...
The NoSQL Revolution (~2009)
• Key-value – Dynamo, Riak, Voldemort, Redis,
Memcached
• Column-oriented – BigTable, Hbase
•...
Apache Cassandra Architecture Qualities
• Distributed,
decentralized
• Elastic scalability
• Commodity hardware?
• High pe...
Two lies and a truth
• Cassandra is column-oriented
• Cassandra is a schemaless database
• Cassandra is eventually consist...
Problems Cassandra is Especially Good At
• Large scale storage
• >10s of TB
• Lots of writes
• Time-series data, IoT
• Sta...
A system must be designed for distribution
from the beginning in order to scale the most
effectively
© DataStax, All Right...
Cluster Topology
• Organization
– Nodes
– Racks
– Data Centers
• Goals
– Distribute copies (replicas) for
high availabilit...
Clusters and Rings
• Organization
– Tokens
– Token Ring
• Goals
– Distribute data evenly across
nodes
• Approaches
– Parti...
Replication and Consistency
• Organization
– Client (with driver)
– Coordinator Node
– Replica nodes
• Goals
– Consistent ...
Cassandra works because it leverages
proven distributed system design patterns
Wait, who is this guy?
Proven Distributed System Techniques
• Gossip
• Failure detection (Phi)
• Partitioning
• Leaderless replication
• Hinted H...
Log Structured Merge Storage Engine
Cassandra’s data model is the key enabler
of its high performance and scalability
Terminology - Cassandra’s Data Model
Terminology – Partition and Clustering Keys
Partition and Clustering Key Example
Labeling:
K – partition key
C – clustering key
PRIMARY KEY ((customer_id), contact_ti...
Cassandra survived because it didn’t
over-extend itself
Tuneable Consistency and Consistency Levels
• ONE, TWO, THREE
• Useful for speed
• ANY (Write only, use with care)
• ALL
•...
Strong Consistency vs. Eventual Consistency
• Eventual consistency
• i.e. W  ONE, R  QUORUM
• Use cases
• Write heavy
• ...
Inventory and Tuneable Consistency
© DataStax, All Rights Reserved.28
Approach Example Scope
QUORUM consistency
for reads ...
Core
application
data
Microservices and Polyglot Persistence
© DataStax, All Rights Reserved.29
Servic
e A
Service
B
Tabul...
Cassandra will remain viable for many years
because of its extensibility and pluggability
Snitch
Partitioner
Compaction
Replication
Secondary
Index
Storage
Engine?
Auth/Auth
Cassandra’s deployment flexibility can’t be
matched by cloud vendor database offerings
Consistency Level and Deployment
Active-Active, Multi-Region
– EACH_QUORUM may limit availability if connection to one reg...
Consistency Level and Deployment
Separate data center for analytics
– Writes in ”online DC” – LOCAL_QUORUM
• Writes to ana...
Cassandra is most powerful when combined
with complementary technologies to form a
data platform
Spark + Cassandra
• Access Cassandra from
Spark via DataStax connector
DataStax is a registered trademark
of DataStax, Inc...
DSE Analytics
DataStax is a registered trademark
of DataStax, Inc. and its
subsidiaries in the United States
and/or other ...
DSE Core - Certified Apache Cassandra
• The best distribution of Apache Cassandra™
• Production certified Cassandra
• Perf...
https://github.com/killrvideo
KillrVideo – a reference application
https://academy.datastax.com
DataStax Academy – a place to learn
41
Contact
jeff.carpenter@
datastax.com
@jscarp
jeffreyscarpenter
medium.com/@jscarp
Cassandra Architecture FTW
Cassandra Architecture FTW
Cassandra Architecture FTW

Upcoming SlideShare

Loading in …5

×

  1. 1. Cassandra Architecture FTW!
  2. 2. Cassandra Architecture FTW!
  3. 3. Cassandra won because it solved a problem that hadn’t yet been solved
  4. 4. What’s wrong with RDBMS? • Pros • Relational data modeling is well understood • SQL is easy to use and ubiquitous • ACID transactions - data integrity • Cons • Scaling is hard • Sharding and replication have side-effects (performance, reliability, cost) • Denormalize to get performance gains For fun: Relational Scaling and the Temple of Gloom
  5. 5. The NoSQL Revolution (~2009) • Key-value – Dynamo, Riak, Voldemort, Redis, Memcached • Column-oriented – BigTable, Hbase • Document – Mongo DB, Document DB, CouchDB • Graph – Neo4J, DSE Graph • Multi-model – DataStax Enterprise, CosmosDB • Relational counter-revolution – AWS Aurora, Google Spanner
  6. 6. Apache Cassandra Architecture Qualities • Distributed, decentralized • Elastic scalability • Commodity hardware? • High performance • High availability / fault tolerant • Tuneable consistency Apache Cassandra ® Apache Software Foundation
  7. 7. Two lies and a truth • Cassandra is column-oriented • Cassandra is a schemaless database • Cassandra is eventually consistent They’re all lies, in a way
  8. 8. Problems Cassandra is Especially Good At • Large scale storage • >10s of TB • Lots of writes • Time-series data, IoT • Statistics and analytics • For example, as a Spark data source • Geographic distribution • Multiple data centers Personalization Customer 360 Recommendation Fraud Detection Inventory Management Identity Management Security Supply Chain
  9. 9. A system must be designed for distribution from the beginning in order to scale the most effectively © DataStax, All Rights Reserved.10
  10. 10. Cluster Topology • Organization – Nodes – Racks – Data Centers • Goals – Distribute copies (replicas) for high availability – Route queries to nearby nodes for high performance • Approaches – Gossip – Snitches
  11. 11. Clusters and Rings • Organization – Tokens – Token Ring • Goals – Distribute data evenly across nodes • Approaches – Partitioners – Virtual nodes
  12. 12. Replication and Consistency • Organization – Client (with driver) – Coordinator Node – Replica nodes • Goals – Consistent data even when nodes are down/unresponsive • Approaches – Replication Factor / Strategy – Consistency Level
  13. 13. Cassandra works because it leverages proven distributed system design patterns
  14. 14. Wait, who is this guy?
  15. 15. Proven Distributed System Techniques • Gossip • Failure detection (Phi) • Partitioning • Leaderless replication • Hinted Handoff • Bloom filters • Consensus algorithms (Paxos) • Log Structured Merge Storage Engine – Memtables – SSTables – Commit log – Compaction – Tombstones • Thrift API • SEDA
  16. 16. Log Structured Merge Storage Engine
  17. 17. Cassandra’s data model is the key enabler of its high performance and scalability
  18. 18. Terminology - Cassandra’s Data Model
  19. 19. Terminology – Partition and Clustering Keys
  20. 20. Partition and Clustering Key Example Labeling: K – partition key C – clustering key PRIMARY KEY ((customer_id), contact_time)
  21. 21. Cassandra survived because it didn’t over-extend itself
  22. 22. Tuneable Consistency and Consistency Levels • ONE, TWO, THREE • Useful for speed • ANY (Write only, use with care) • ALL • Number of nodes to respond = RF • Overly restrictive? • QUORUM • (RF / 2) + 1 • Frequently very useful • LOCAL_ONE, LOCAL_QUORUM – Similar to above, but nodes must be in local data center • EACH_QUORUM – Quorum of nodes must respond in each data center
  23. 23. Strong Consistency vs. Eventual Consistency • Eventual consistency • i.e. W  ONE, R  QUORUM • Use cases • Write heavy • Data not read immediately • Strong Consistency • i.e. W QUORUM, R  QUORUM • Use cases • Read after write • Data loading with validation • Strong Consistency Formula • R + W > RF = strong consistency • R: read replica count required by consistency level • W: the write replica count required by consistency level • RF: replication factor • Example: W QUORUM, R  QUORUM, RF = 3 • 2 + 2 > 3 • Implication: all client reads will see the most recent write
  24. 24. Inventory and Tuneable Consistency © DataStax, All Rights Reserved.28 Approach Example Scope QUORUM consistency for reads and writes Ensuring latest inventory counts are always read Data Tier Lightweight Transaction Updating inventory counts Data Tier Logged Batch Writing to multiple denormalized tables Data Tier Retrying failed calls Data synchronization, reservation processing Service / Application Tier Compensating processes Verifying reservation processing System Customer service remediation Rebooking, back order, substitution System Eventual consistency Strong consistency
  25. 25. Core application data Microservices and Polyglot Persistence © DataStax, All Rights Reserved.29 Servic e A Service B Tabular Key-value Servic e C RelationalDocument Graph Service D Service E Reference data Content Customer relationship data Legacy, low volume data
  26. 26. Cassandra will remain viable for many years because of its extensibility and pluggability
  27. 27. Snitch Partitioner Compaction Replication Secondary Index Storage Engine? Auth/Auth
  28. 28. Cassandra’s deployment flexibility can’t be matched by cloud vendor database offerings
  29. 29. Consistency Level and Deployment Active-Active, Multi-Region – EACH_QUORUM may limit availability if connection to one region/data center is down – LOCAL_QUORUM, relying on Cassandra to complete writes to remote data centers – QUORUM is a reasonable middle-ground approach
  30. 30. Consistency Level and Deployment Separate data center for analytics – Writes in ”online DC” – LOCAL_QUORUM • Writes to analytics DC in background • Analytic DC availability/performance decoupled from online DC • EACH_QUORUM, QUORUM overkill, unless “real-time” analytics required – Reads in “analytics DC” – LOCAL_QUORUM • Or even LOCAL_ONE
  31. 31. Cassandra is most powerful when combined with complementary technologies to form a data platform
  32. 32. Spark + Cassandra • Access Cassandra from Spark via DataStax connector DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries. 3 6 • Co-locate Spark and Cassandra Spa rk SQL Spark Streami ng MLi b Grap hX Spar k R Spark Core Engine DataStax Spark-Cassandra Connector Cassandra
  33. 33. DSE Analytics DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries. Application Real Time Operations Cassandra Analytics Analytics Queries Your Analytics Real Time Replication Single DSE Custer Streaming, ad-hoc, and batch • High-performance • Workload management • SQL reporting Compared to self-managed Spark cluster: • No ETL • True HA without Zookeeper
  34. 34. DSE Core - Certified Apache Cassandra • The best distribution of Apache Cassandra™ • Production certified Cassandra • Performance improvements • Advanced Security • Multi-tenancy through row-level access control • Advanced Replication • Great for retail and IoT use cases DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries.
  35. 35. https://github.com/killrvideo KillrVideo – a reference application
  36. 36. https://academy.datastax.com DataStax Academy – a place to learn
  37. 37. 41 Contact jeff.carpenter@ datastax.com @jscarp jeffreyscarpenter medium.com/@jscarp

×

Related Articles

cassandra
architecture
video

Ch5 Session04 - Apache Cassandra Architecture

John Doe

11/4/2022

cassandra
architecture

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra