Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verm…
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* Summit 2016
Running Cassandra on Apache Mesos
across multiple datacenters at Uber
Abhishek Verma (verma@uber.com)
About me
● MS (2010) and PhD (2012) in Computer Science from University of Illinois
at Urbana-Champaign
● 2 years at Googl...
“Transportation as reliable as running water,
everywhere, for everyone”
“Transportation as reliable as running
water, everywhere, for everyone”
99.99%
“Transportation as reliable as running water,
everywhere, for everyone”
efficient
Cluster Management @ Uber
● Statically partitioned machines across different services
● Move from custom deployment system...
Apache Mesos
7
● Mesos abstracts CPU, memory, storage away from machines
○ program like it’s a single pool of resources
● ...
Apache Cassandra
8
● Horizontal scalability
○ Scales reads and writes linearly as new nodes are added
● High availability
...
Uber
● Abhishek Verma
● Karthik Gandhi
● Matthias Eichstaedt
● Varun Gupta
● Zhitao Li
DC/OS Cassandra Service
9
Mesospher...
Cassandra service architecture
10
Framework
dcos-cassandra-service
Mesos agent
Mesos master
(Leader)
Web interface
Control...
Cassandra Mesos primitives
11
● Mesos containerizer
● Override 5 ports in configuration (storage_port,
ssl_storage_port, n...
Custom seed provider
12
Node 1
10.0.0.1
http://scheduler/seeds
{
isSeed: true
seeds: [ ]
}
Node 1
10.0.0.1
Node 2
10.0.0.2...
Cassandra Service: Features
13
● Custom seed provider
● Increasing cluster size
● Changing Cassandra configuration
● Repla...
Plan, Phases and Blocks
14
● Plan
○ Phases
■ Reconciliation
■ Deployment
■ Backup
■ Restore
■ Cleanup
■ Repair
Spinning up a new Cassandra cluster
15
https://www.youtube.com/watch?v=gbYmjtDKSzs
Automate Cassandra operations
16
● Repair
○ Synchronize all data across replicas
■ Last write wins
○ Anti-entropy mechanis...
Cleanup operation
17
https://www.youtube.com/watch?v=VxRLSl8MpYI
Failure scenarios
18
● Executor failure
○ Restarted automatically
● Cassandra daemon failure
○ Restarted automatically
● N...
Experiments
19
Cluster startup
20
For each node in the cluster:
1.Receive and accept offer
2.Launch task
3.Fetch executor, JRE, Cassandra...
Cluster startup time
21
Framework can start ~ one new node per minute
Tuning JVM Garbage collection
22
Changed from CMS to G1 garbage collector
Left: https://github.com/apache/cassandra/blob/c...
Tuning JVM Garbage collection
23
Metric CMS G1
G1 : CMS
Factor
op rate 1951 13765 7.06
latency mean (ms) 3.6 0.4 9.00
late...
Cluster Setup
24
● 3 nodes
● Local DC
● 24 cores, 128 GB RAM, 2TB SAS drives
● Cassandra running on bare metal
● Cassandra...
Bare metal Mesos
Read Latency
25
Mean: 0.38 ms
P95: 0.74 ms
P99: 0.91 ms
Mean: 0.44 ms
P95: 0.76 ms
P99: 0.98 ms
Bare metal Mesos
Read Throughput
26
Bare metal Mesos
Write Latency
27
Mean: 0.43 ms
P95: 0.94 ms
P99: 1.05 ms
Mean: 0.48 ms
P95: 0.93 ms
P99: 1.26 ms
Bare metal Mesos
Write Throughput
28
Running across datacenters
29
● Four datacenters
○ Each running dcos-cassandra-service instance
○ Sync datacenter phase
■ ...
Asynchronous cross-dc replication latency
30
● Write a row to dc1 using consistency level LOCAL_ONE
○ Write timestamp to a...
Cassandra on Mesos in Production
31
● ~20 clusters replicating across two datacenters (west and east coast)
● ~300 machine...
Questions?
32
verma@uber.com
Cluster startup
33
For each node in the cluster:
1.Receive and accept offer
2.Launch task
3.Fetch executor, JRE, Cassandra...
Aurora hogs offers
34
● Aurora designed to be the only framework running on Mesos and
controlling all the machines
● Holds...
Long term solution: dynamic reservations
35
● Dynamically reserve all the machines resources to the “cassandra”
role
● Res...
Using the Cassandra cluster
36
https://www.youtube.com/watch?v=qgqO39DteHo

Upcoming SlideShare

Loading in …5

×

  1. 1. Running Cassandra on Apache Mesos across multiple datacenters at Uber Abhishek Verma (verma@uber.com)
  2. 2. About me ● MS (2010) and PhD (2012) in Computer Science from University of Illinois at Urbana-Champaign ● 2 years at Google, worked on Borg and Omega and first author of the Borg paper ● ~ 1 year at TCS Research, Mumbai ● Currently at Uber working on running Cassandra on Mesos © DataStax, All Rights Reserved. 2
  3. 3. “Transportation as reliable as running water, everywhere, for everyone”
  4. 4. “Transportation as reliable as running water, everywhere, for everyone” 99.99%
  5. 5. “Transportation as reliable as running water, everywhere, for everyone” efficient
  6. 6. Cluster Management @ Uber ● Statically partitioned machines across different services ● Move from custom deployment system to everything running on Mesos ● Gain efficiency by increasing machine utilization ○ Co-locate services on the same machine ○ Can lead to 30% fewer machines1 ● Build stateful service frameworks to run on Mesos © DataStax, All Rights Reserved. 6 “Large-scale cluster management at Google with Borg”, EuroSys 2015
  7. 7. Apache Mesos 7 ● Mesos abstracts CPU, memory, storage away from machines ○ program like it’s a single pool of resources ● Linear scalability ● High availability ● Native support for launching containers ● Pluggable resource isolation ● Two level scheduling
  8. 8. Apache Cassandra 8 ● Horizontal scalability ○ Scales reads and writes linearly as new nodes are added ● High availability ○ Fault tolerant with tunable consistency levels ● Low latency, solid performance ● Operational simplicity ○ Homogeneous cluster, no SPOF ● Rich data model
  9. 9. Uber ● Abhishek Verma ● Karthik Gandhi ● Matthias Eichstaedt ● Varun Gupta ● Zhitao Li DC/OS Cassandra Service 9 Mesosphere ● Chris Lambert ● Gabriel Hartmann ● Keith Chambers ● Kenneth Owens ● Mohit Soni https://github.com/mesosphere/dcos-cassandra-service
  10. 10. Cassandra service architecture 10 Framework dcos-cassandra-service Mesos agent Mesos master (Leader) Web interface Control plane API C*Cluster 1 C*Cluster 2 Aurora (DC1) Mesos master (Standby) C*Node 1a C*Node 2a Mesos agent C*Node 1b C*Node 2b Mesos agent C*Node 1c Aurora (DC2) Deployment system DC2 ZK ZK ZK ZooKeeper quorum Client App uses CQL interface CQL CQL CQL CQL CQL . . .
  11. 11. Cassandra Mesos primitives 11 ● Mesos containerizer ● Override 5 ports in configuration (storage_port, ssl_storage_port, native_transport_port, rpc_port, jmx_port) ● Use persistent volumes ○ Data stored outside of the sandbox directory ○ Offered to the same task if it crashes and restarts ● Use dynamic reservation
  12. 12. Custom seed provider 12 Node 1 10.0.0.1 http://scheduler/seeds { isSeed: true seeds: [ ] } Node 1 10.0.0.1 Node 2 10.0.0.2 Node 3 10.0.0.3 Node 2 10.0.0.2 { isSeed: true seeds: [ 10.0.0.1] } { isSeed: false seeds: [ 10.0.0.1, 10.0.0.2] } Node 3 10.0.0.3 Number of Nodes = 3 Number of Seeds = 2
  13. 13. Cassandra Service: Features 13 ● Custom seed provider ● Increasing cluster size ● Changing Cassandra configuration ● Replacing a dead node ● Backup/Restore ● Cleanup ● Repair
  14. 14. Plan, Phases and Blocks 14 ● Plan ○ Phases ■ Reconciliation ■ Deployment ■ Backup ■ Restore ■ Cleanup ■ Repair
  15. 15. Spinning up a new Cassandra cluster 15 https://www.youtube.com/watch?v=gbYmjtDKSzs
  16. 16. Automate Cassandra operations 16 ● Repair ○ Synchronize all data across replicas ■ Last write wins ○ Anti-entropy mechanism ○ Repair primary key range node-by-node ● Cleanup ○ Remove data whose ownership has changed ■ Because of addition or removal of nodes
  17. 17. Cleanup operation 17 https://www.youtube.com/watch?v=VxRLSl8MpYI
  18. 18. Failure scenarios 18 ● Executor failure ○ Restarted automatically ● Cassandra daemon failure ○ Restarted automatically ● Node failure ○ Manual REST endpoint to replace node ● Scheduling framework failure ○ Existing nodes keep running, new nodes cannot be added
  19. 19. Experiments 19
  20. 20. Cluster startup 20 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL
  21. 21. Cluster startup time 21 Framework can start ~ one new node per minute
  22. 22. Tuning JVM Garbage collection 22 Changed from CMS to G1 garbage collector Left: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/cassandra-env.sh#L213 Right: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_tune_jvm_c.html?scroll=concept_ds_sv5_k4w_dk__tuning-java-garbage-collection
  23. 23. Tuning JVM Garbage collection 23 Metric CMS G1 G1 : CMS Factor op rate 1951 13765 7.06 latency mean (ms) 3.6 0.4 9.00 latency median (ms) 0.3 0.3 1.00 latency 95th percentile (ms) 0.6 0.4 1.50 latency 99th percentile (ms) 1 0.5 2.00 latency 99.9th percentile (ms) 11.6 0.7 16.57 latency max (ms) 13496.9 4626.9 2.92 G1 garbage collector is much better without any tuning Using cassandra-stress, 32 threads client
  24. 24. Cluster Setup 24 ● 3 nodes ● Local DC ● 24 cores, 128 GB RAM, 2TB SAS drives ● Cassandra running on bare metal ● Cassandra running in a Mesos container
  25. 25. Bare metal Mesos Read Latency 25 Mean: 0.38 ms P95: 0.74 ms P99: 0.91 ms Mean: 0.44 ms P95: 0.76 ms P99: 0.98 ms
  26. 26. Bare metal Mesos Read Throughput 26
  27. 27. Bare metal Mesos Write Latency 27 Mean: 0.43 ms P95: 0.94 ms P99: 1.05 ms Mean: 0.48 ms P95: 0.93 ms P99: 1.26 ms
  28. 28. Bare metal Mesos Write Throughput 28
  29. 29. Running across datacenters 29 ● Four datacenters ○ Each running dcos-cassandra-service instance ○ Sync datacenter phase ■ Periodically exchange seeds with external dcs ● Cassandra nodes gossip topology ○ Discover nodes in other datacenters
  30. 30. Asynchronous cross-dc replication latency 30 ● Write a row to dc1 using consistency level LOCAL_ONE ○ Write timestamp to a file when operation completed ● Spin in a loop to read the same row using consistency LOCAL_ONE in dc2 ○ Write timestamp to a file when operation completed ● Difference between the two gives asynchronous replication latency ○ p50 : 44.69ms, p95 : 46.38ms, p99:47.44ms ● Round trip ping latency ○ 77.8ms
  31. 31. Cassandra on Mesos in Production 31 ● ~20 clusters replicating across two datacenters (west and east coast) ● ~300 machines across two datacenters ● Largest 2 clusters: more than a million writes/sec and ~100k reads/sec ● Mean read latency: 13ms and write latency: 25ms ● Mostly use LOCAL_QUORUM consistency level
  32. 32. Questions? 32 verma@uber.com
  33. 33. Cluster startup 33 For each node in the cluster: 1.Receive and accept offer 2.Launch task 3.Fetch executor, JRE, Cassandra binaries from S3/HDFS 4.Launch executor 5.Launch Cassandra daemon 6.Wait for it’s mode to transition STARTING -> JOINING -> NORMAL Aurora hogging offers
  34. 34. Aurora hogs offers 34 ● Aurora designed to be the only framework running on Mesos and controlling all the machines ● Holds on to all received offers ○ Does not accept or reject them ● Mesos waits for --offer_timeout time duration and rescinds offer ● --offer_timeout config ○ Duration of time before an offer is rescinded from a framework. This helps fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. If not set, offers do not timeout.
  35. 35. Long term solution: dynamic reservations 35 ● Dynamically reserve all the machines resources to the “cassandra” role ● Resources are offered only to cassandra frameworks ● Improves node startup time: 30s/node ● Node failure replacement or updates are much faster
  36. 36. Using the Cassandra cluster 36 https://www.youtube.com/watch?v=qgqO39DteHo