Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

10/19/2017

Reading time:7 min

Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* Summit 2016

by John Doe

Show MoreNo DownloadsNo notes for slide1. Running Cassandra on Apache Mesos across multiple datacenters at Uber Abhishek Verma (verma@uber.com)2. About me ● MS (2010) and PhD (2012) in Computer Science from University of Illinois at Urbana-Champaign ● 2 years at Google, worked on Borg and Omega and first author of the Borg paper ● ~ 1 year at TCS Research, Mumbai ● Currently at Uber working on running Cassandra on Mesos © DataStax, All Rights Reserved. 23. “Transportation as reliable as running water, everywhere, for everyone”4. “Transportation as reliable as running water, everywhere, for everyone” 99.99%5. “Transportation as reliable as running water, everywhere, for everyone” efficient6. Cluster Management @ Uber ● Statically partitioned machines across different services ● Move from custom deployment system to everything running on Mesos ● Gain efficiency by increasing machine utilization ○ Co-locate services on the same machine ○ Can lead to 30% fewer machines1 ● Build stateful service frameworks to run on Mesos © DataStax, All Rights Reserved. 6 “Large-scale cluster management at Google with Borg”, EuroSys 20157. Apache Mesos 7 ● Mesos abstracts CPU, memory, storage away from machines ○ program like it’s a single pool of resources ● Linear scalability ● High availability ● Native support for launching containers ● Pluggable resource isolation ● Two level scheduling8. Apache Cassandra 8 ● Horizontal scalability ○ Scales reads and writes linearly as new nodes are added ● High availability ○ Fault tolerant with tunable consistency levels ● Low latency, solid performance ● Operational simplicity ○ Homogeneous cluster, no SPOF ● Rich data model9. Uber ● Abhishek Verma ● Karthik Gandhi ● Matthias Eichstaedt ● Varun Gupta ● Zhitao Li DC/OS Cassandra Service 9 Mesosphere ● Chris Lambert ● Gabriel Hartmann ● Keith Chambers ● Kenneth Owens ● Mohit Soni https://github.com/mesosphere/dcos-cassandra-service10. Cassandra service architecture 10 Framework dcos-cassandra-service Mesos agent Mesos master (Leader) Web interface Control plane API C*Cluster 1 C*Cluster 2 Aurora (DC1) Mesos master (Standby) C*Node 1a C*Node 2a Mesos agent C*Node 1b C*Node 2b Mesos agent C*Node 1c Aurora (DC2) Deployment system DC2 ZK ZK ZK ZooKeeper quorum Client App uses CQL interface CQL CQL CQL CQL CQL . . .11. Cassandra Mesos primitives 11 ● Mesos containerizer ● Override 5 ports in configuration (storage_port, ssl_storage_port, native_transport_port, rpc_port, jmx_port) ● Use persistent volumes ○ Data stored outside of the sandbox directory ○ Offered to the same task if it crashes and restarts ● Use dynamic reservation12. Custom seed provider 12 Node 1 10.0.0.1 http://scheduler/seeds { isSeed: true seeds: [ ] } Node 1 10.0.0.1 Node 2 10.0.0.2 Node 3 10.0.0.3 Node 2 10.0.0.2 { isSeed: true seeds: [ 10.0.0.1] } { isSeed: false seeds: [ 10.0.0.1, 10.0.0.2] } Node 3 10.0.0.3 Number of Nodes = 3 Number of Seeds = 213. Cassandra Service: Features 13 ● Custom seed provider ● Increasing cluster size ● Changing Cassandra configuration ● Replacing a dead node ● Backup/Restore ● Cleanup ● Repair14. Plan, Phases and Blocks 14 ● Plan ○ Phases ■ Reconciliation ■ Deployment ■ Backup ■ Restore ■ Cleanup ■ Repair15. Spinning up a new Cassandra cluster 15 https://www.youtube.com/watch?v=gbYmjtDKSzs16. Automate Cassandra operations 16 ● Repair ○ Synchronize all data across replicas ■ Last write wins ○ Anti-entropy mechanism ○ Repair primary key range node-by-node ● Cleanup ○ Remove data whose ownership has changed ■ Because of addition or removal of nodes17. Cleanup operation 17 https://www.youtube.com/watch?v=VxRLSl8MpYI18. Failure scenarios 18 ● Executor failure ○ Restarted automatically ● Cassandra daemon failure ○ Restarted automatically ● Node failure ○ Manual REST endpoint to replace node ● Scheduling framework failure ○ Existing nodes keep running, new nodes cannot be added19. Experiments 1920. Cluster startup 20 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL21. Cluster startup time 21 Framework can start ~ one new node per minute22. Tuning JVM Garbage collection 22 Changed from CMS to G1 garbage collector Left: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/cassandra-env.sh#L213 Right: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_tune_jvm_c.html?scroll=concept_ds_sv5_k4w_dk__tuning-java-garbage-collection23. Tuning JVM Garbage collection 23 Metric CMS G1 G1 : CMS Factor op rate 1951 13765 7.06 latency mean (ms) 3.6 0.4 9.00 latency median (ms) 0.3 0.3 1.00 latency 95th percentile (ms) 0.6 0.4 1.50 latency 99th percentile (ms) 1 0.5 2.00 latency 99.9th percentile (ms) 11.6 0.7 16.57 latency max (ms) 13496.9 4626.9 2.92 G1 garbage collector is much better without any tuning Using cassandra-stress, 32 threads client24. Cluster Setup 24 ● 3 nodes ● Local DC ● 24 cores, 128 GB RAM, 2TB SAS drives ● Cassandra running on bare metal ● Cassandra running in a Mesos container25. Bare metal Mesos Read Latency 25 Mean: 0.38 ms P95: 0.74 ms P99: 0.91 ms Mean: 0.44 ms P95: 0.76 ms P99: 0.98 ms26. Bare metal Mesos Read Throughput 2627. Bare metal Mesos Write Latency 27 Mean: 0.43 ms P95: 0.94 ms P99: 1.05 ms Mean: 0.48 ms P95: 0.93 ms P99: 1.26 ms28. Bare metal Mesos Write Throughput 2829. Running across datacenters 29 ● Four datacenters ○ Each running dcos-cassandra-service instance ○ Sync datacenter phase ■ Periodically exchange seeds with external dcs ● Cassandra nodes gossip topology ○ Discover nodes in other datacenters30. Asynchronous cross-dc replication latency 30 ● Write a row to dc1 using consistency level LOCAL_ONE ○ Write timestamp to a file when operation completed ● Spin in a loop to read the same row using consistency LOCAL_ONE in dc2 ○ Write timestamp to a file when operation completed ● Difference between the two gives asynchronous replication latency ○ p50 : 44.69ms, p95 : 46.38ms, p99:47.44ms ● Round trip ping latency ○ 77.8ms31. Cassandra on Mesos in Production 31 ● ~20 clusters replicating across two datacenters (west and east coast) ● ~300 machines across two datacenters ● Largest 2 clusters: more than a million writes/sec and ~100k reads/sec ● Mean read latency: 13ms and write latency: 25ms ● Mostly use LOCAL_QUORUM consistency level32. Questions? 32 verma@uber.com33. Cluster startup 33 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL Aurora hogging offers34. Aurora hogs offers 34 ● Aurora designed to be the only framework running on Mesos and controlling all the machines ● Holds on to all received offers ○ Does not accept or reject them ● Mesos waits for --offer_timeout time duration and rescinds offer ● --offer_timeout config ○ Duration of time before an offer is rescinded from a framework. This helps fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. If not set, offers do not timeout.35. Long term solution: dynamic reservations 35 ● Dynamically reserve all the machines resources to the “cassandra” role ● Resources are offered only to cassandra frameworks ● Improves node startup time: 30s/node ● Node failure replacement or updates are much faster36. Using the Cassandra cluster 36 https://www.youtube.com/watch?v=qgqO39DteHoRecommendedSocial Media in the ClassroomOnline Course - LinkedIn LearningLearning to Teach OnlineOnline Course - LinkedIn LearningCollege Prep: Writing a Strong EssayOnline Course - LinkedIn LearningBuilding Real-Time Applications with Android and WebSocketsSergi Almar i GrauperaUber's new mobile architectureDhaval PatelOpen-source Infrastructure at LyftDaniel HochmanTaxi Startup Presentation for Taxi CompanyEugene Suslo"Building Data Foundations and Analytics Tools Across The Product" by Crystal...Tech in Asia IDJust Add Reality: Managing Logistics with the Uber Developer PlatformApigee | Google CloudGeospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Daniel HochmanPublic clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide!Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips.

Illustration Image
  1. 1. Running Cassandra on Apache Mesos across multiple datacenters at Uber Abhishek Verma (verma@uber.com)
  2. 2. About me ● MS (2010) and PhD (2012) in Computer Science from University of Illinois at Urbana-Champaign ● 2 years at Google, worked on Borg and Omega and first author of the Borg paper ● ~ 1 year at TCS Research, Mumbai ● Currently at Uber working on running Cassandra on Mesos © DataStax, All Rights Reserved. 2
  3. 3. “Transportation as reliable as running water, everywhere, for everyone”
  4. 4. “Transportation as reliable as running water, everywhere, for everyone” 99.99%
  5. 5. “Transportation as reliable as running water, everywhere, for everyone” efficient
  6. 6. Cluster Management @ Uber ● Statically partitioned machines across different services ● Move from custom deployment system to everything running on Mesos ● Gain efficiency by increasing machine utilization ○ Co-locate services on the same machine ○ Can lead to 30% fewer machines1 ● Build stateful service frameworks to run on Mesos © DataStax, All Rights Reserved. 6 “Large-scale cluster management at Google with Borg”, EuroSys 2015
  7. 7. Apache Mesos 7 ● Mesos abstracts CPU, memory, storage away from machines ○ program like it’s a single pool of resources ● Linear scalability ● High availability ● Native support for launching containers ● Pluggable resource isolation ● Two level scheduling
  8. 8. Apache Cassandra 8 ● Horizontal scalability ○ Scales reads and writes linearly as new nodes are added ● High availability ○ Fault tolerant with tunable consistency levels ● Low latency, solid performance ● Operational simplicity ○ Homogeneous cluster, no SPOF ● Rich data model
  9. 9. Uber ● Abhishek Verma ● Karthik Gandhi ● Matthias Eichstaedt ● Varun Gupta ● Zhitao Li DC/OS Cassandra Service 9 Mesosphere ● Chris Lambert ● Gabriel Hartmann ● Keith Chambers ● Kenneth Owens ● Mohit Soni https://github.com/mesosphere/dcos-cassandra-service
  10. 10. Cassandra service architecture 10 Framework dcos-cassandra-service Mesos agent Mesos master (Leader) Web interface Control plane API C*Cluster 1 C*Cluster 2 Aurora (DC1) Mesos master (Standby) C*Node 1a C*Node 2a Mesos agent C*Node 1b C*Node 2b Mesos agent C*Node 1c Aurora (DC2) Deployment system DC2 ZK ZK ZK ZooKeeper quorum Client App uses CQL interface CQL CQL CQL CQL CQL . . .
  11. 11. Cassandra Mesos primitives 11 ● Mesos containerizer ● Override 5 ports in configuration (storage_port, ssl_storage_port, native_transport_port, rpc_port, jmx_port) ● Use persistent volumes ○ Data stored outside of the sandbox directory ○ Offered to the same task if it crashes and restarts ● Use dynamic reservation
  12. 12. Custom seed provider 12 Node 1 10.0.0.1 http://scheduler/seeds { isSeed: true seeds: [ ] } Node 1 10.0.0.1 Node 2 10.0.0.2 Node 3 10.0.0.3 Node 2 10.0.0.2 { isSeed: true seeds: [ 10.0.0.1] } { isSeed: false seeds: [ 10.0.0.1, 10.0.0.2] } Node 3 10.0.0.3 Number of Nodes = 3 Number of Seeds = 2
  13. 13. Cassandra Service: Features 13 ● Custom seed provider ● Increasing cluster size ● Changing Cassandra configuration ● Replacing a dead node ● Backup/Restore ● Cleanup ● Repair
  14. 14. Plan, Phases and Blocks 14 ● Plan ○ Phases ■ Reconciliation ■ Deployment ■ Backup ■ Restore ■ Cleanup ■ Repair
  15. 15. Spinning up a new Cassandra cluster 15 https://www.youtube.com/watch?v=gbYmjtDKSzs
  16. 16. Automate Cassandra operations 16 ● Repair ○ Synchronize all data across replicas ■ Last write wins ○ Anti-entropy mechanism ○ Repair primary key range node-by-node ● Cleanup ○ Remove data whose ownership has changed ■ Because of addition or removal of nodes
  17. 17. Cleanup operation 17 https://www.youtube.com/watch?v=VxRLSl8MpYI
  18. 18. Failure scenarios 18 ● Executor failure ○ Restarted automatically ● Cassandra daemon failure ○ Restarted automatically ● Node failure ○ Manual REST endpoint to replace node ● Scheduling framework failure ○ Existing nodes keep running, new nodes cannot be added
  19. 19. Experiments 19
  20. 20. Cluster startup 20 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL
  21. 21. Cluster startup time 21 Framework can start ~ one new node per minute
  22. 22. Tuning JVM Garbage collection 22 Changed from CMS to G1 garbage collector Left: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/cassandra-env.sh#L213 Right: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_tune_jvm_c.html?scroll=concept_ds_sv5_k4w_dk__tuning-java-garbage-collection
  23. 23. Tuning JVM Garbage collection 23 Metric CMS G1 G1 : CMS Factor op rate 1951 13765 7.06 latency mean (ms) 3.6 0.4 9.00 latency median (ms) 0.3 0.3 1.00 latency 95th percentile (ms) 0.6 0.4 1.50 latency 99th percentile (ms) 1 0.5 2.00 latency 99.9th percentile (ms) 11.6 0.7 16.57 latency max (ms) 13496.9 4626.9 2.92 G1 garbage collector is much better without any tuning Using cassandra-stress, 32 threads client
  24. 24. Cluster Setup 24 ● 3 nodes ● Local DC ● 24 cores, 128 GB RAM, 2TB SAS drives ● Cassandra running on bare metal ● Cassandra running in a Mesos container
  25. 25. Bare metal Mesos Read Latency 25 Mean: 0.38 ms P95: 0.74 ms P99: 0.91 ms Mean: 0.44 ms P95: 0.76 ms P99: 0.98 ms
  26. 26. Bare metal Mesos Read Throughput 26
  27. 27. Bare metal Mesos Write Latency 27 Mean: 0.43 ms P95: 0.94 ms P99: 1.05 ms Mean: 0.48 ms P95: 0.93 ms P99: 1.26 ms
  28. 28. Bare metal Mesos Write Throughput 28
  29. 29. Running across datacenters 29 ● Four datacenters ○ Each running dcos-cassandra-service instance ○ Sync datacenter phase ■ Periodically exchange seeds with external dcs ● Cassandra nodes gossip topology ○ Discover nodes in other datacenters
  30. 30. Asynchronous cross-dc replication latency 30 ● Write a row to dc1 using consistency level LOCAL_ONE ○ Write timestamp to a file when operation completed ● Spin in a loop to read the same row using consistency LOCAL_ONE in dc2 ○ Write timestamp to a file when operation completed ● Difference between the two gives asynchronous replication latency ○ p50 : 44.69ms, p95 : 46.38ms, p99:47.44ms ● Round trip ping latency ○ 77.8ms
  31. 31. Cassandra on Mesos in Production 31 ● ~20 clusters replicating across two datacenters (west and east coast) ● ~300 machines across two datacenters ● Largest 2 clusters: more than a million writes/sec and ~100k reads/sec ● Mean read latency: 13ms and write latency: 25ms ● Mostly use LOCAL_QUORUM consistency level
  32. 32. Questions? 32 verma@uber.com
  33. 33. Cluster startup 33 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL Aurora hogging offers
  34. 34. Aurora hogs offers 34 ● Aurora designed to be the only framework running on Mesos and controlling all the machines ● Holds on to all received offers ○ Does not accept or reject them ● Mesos waits for --offer_timeout time duration and rescinds offer ● --offer_timeout config ○ Duration of time before an offer is rescinded from a framework. This helps fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. If not set, offers do not timeout.
  35. 35. Long term solution: dynamic reservations 35 ● Dynamically reserve all the machines resources to the “cassandra” role ● Resources are offered only to cassandra frameworks ● Improves node startup time: 30s/node ● Node failure replacement or updates are much faster
  36. 36. Using the Cassandra cluster 36 https://www.youtube.com/watch?v=qgqO39DteHo

Related Articles

cassandra
devops

Cassandra - DevOps

niqdev

2/17/2023

mesos
uber
cassandra

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

mesos