Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

2/3/2019

Reading time:10 min

Cassandra at teads

by Romain Hardouin

Cassandra at teads SlideShare Explorer Vous Ce diaporama a bien été signalé.Cassandra at teadsProchain SlideShareChargement dans…5× 0 commentaire 1 j’aime Statistiques Remarques Atharva Chauthaiwale , AWS Certified | Cloud Native & Microservices Enthusiast at Principal Software Engineer Aucun téléchargementAucune remarque pour cette diapositive 1. Cassandra @Lyon Cassandra UsersRomain Hardouin - Cassandra architect @ Teads2017-02-16 2. I.II.III.IV.V.VI.VII.VIII.Cassandra @ TeadsAbout TeadsArchitectureProvisioningMonitoring & alertingTuningToolsC’est la vieA light fork 3. I. About Teads 4. Teads is theinventor ofnative videoadvertisingWith inRead,an award-winningformat**IPA Media owner survey 2014, IAB recognized format 5. 27 officesin 21 countries500+Global employees1.2B usersGlobal reach90+R&D employees 6. Teads growthTracking events 7. Advertisers (to name a few) 8. Publishers (to name a few) 9. II. Architecture 10. Custom C* 2.1.16C* 3.0 jvm.optionsC* 2.2 logbackBackportsPatchesApache Cassandra versionApache Cassandra version 11. UsageUsageUp to 940K qps: Writes vs Reads 12. TopologyTopology2 regions: EU & US3rdregion APAC coming soon4 clusters7 DC110 nodesUp to 150 with temporary DCsHP server blades1 cluster18 nodes 13. AWS nodes 14. i2.2xlarge8 vCPU 61GB2 x 800 GB attached SSD in RAID0c3.4xlarge16 vCPU 30GB2 x 160 GB attached SSD in RAID0c4.4xlarge16 vCPU 30GBEBS 3.4 TB + 1 TBAWS instance typesAWS instance typesTons of countersBig Data, wide rowsMany billions keys, LCS with TTL 15. 20 x c4.4xlarge with SSD GP23.4 TB data 10,000 IOPS⇒ 16KB1 TB commitlog 3,000 IOPS⇒ 16KB25 tables: batch + real timeTemporary DCCheap storage, great for STCSSnapshots (S3 backup)No coupling between disks and CPU/RAMHigh latency => high I/O waitThroughput: 160 MB/sUnsteady performancesMore on EBS nodesMore on EBS nodes 16. Physical nodes 17. HP Apollo XL170R Gen912 CPU Xeon @ 2.60GHz128 GB RAM3 x 1,5 TB High-end SSD in RAID0Hardware nodesHardware nodesFor Big Data, supersedes EBS DC 18. DC/Cluster split 19. Instance type changeInstance type change20 x i2.2xlarge 20 x c3.4xlargeCountersCheaper and more CPUsCounters RebuildDC X DC Y 20. Workload isolationWorkload isolation20 x i2.2xlarge 20 x c3.4xlargeCounters +Big DataCounters20 x c4.4xlargeBig DataEBSStep 1: DC splitDC A DC B DC CRebuild+ 21. Workload isolationWorkload isolation20 x c4.4xlargeBig DataEBSStep 2: Cluster splitBig DataAWS Direct Connect 22. Data model 23. “KISS” principleNo fancy stuffNo secondary indexNo list/set/tupleNo UDT 24. III. Provisioning 25. Capistrano Chef→Custom Cookbooks:C*C* toolsC* reaperDatadog wrapperChef provisioning to spawn a clusterNowNow 26. C* cookbookmichaelklishin/cassandra-chef-cookbook+ Teads custom wrapperTerraform + Chef provisionerFutureFuture 27. IV. Monitoring & alerting 28. PastPastOpsCenter (Free) 29. Turnkey dashboardsSupport is reactiveMain metrics onlyPer host graphsimpossible with many hosts 30. Ring viewMore than monitoringLots of metricsStill lacks some metricsDashboard creation: no templatesAgent is heavyFree version limitations:Data stored in production clusterApache C* <= 2.1 onlyDataStax OpsCenterFree version (v5) 31. All metrics you wantDashboard creationTemplatingTimeBoard vs ScreenBoardGraph creationAggregation, trend, rate, anomaly detectionNo turnkey dashboards yetMay change: TLP templatesAdditional fees if >350 metricsWe need to increase this limit for our use case 32. Now we can easilyFind outliersCompare a node to averageCompare two DCsExplore a node’s metricsCreate overview dashboardsCreate advanced dashboards fortroubleshooting 33. Datadog’s cassandra.yamlDatadog’s cassandra.yaml- include:bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=*attribute:- Count- include:bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation)attribute:- Count- 99thPercentile- include:bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize- include:bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueuedattribute:Value:alias: cassandra.ntr.MaxTasksQueued 34. ScreenBoardScreenBoard 35. TimeBoardTimeBoardalphabetagammadeltaepsilonzetaetaalpha 36. ExampleExampleHints monitoring during maintenance on physical nodesStorageStreaming 37. Datadog AlertingDown nodeExceptionsCommitlog sizeHigh latencyHigh GCHigh IO WaitHigh PendingsMany hintsLong thrift connectionsClock out of syncDisk spaceDon’t miss this oneDon’t forget / 38. V. Tuning 39. Java 8CMS G1→cassandra-env.sh-Dcassandra.max_queued_native_transport_requests=4096-Dcassandra.fd_initial_value_ms=4000-Dcassandra.fd_max_interval_ms=4000 40. GC logs enabled-XX:MaxGCPauseMillis=200-XX:G1RSetUpdatingPauseTimePercent=5-XX:G1HeapRegionSize=32m-XX:G1HeapWastePercent=25-XX:InitiatingHeapOccupancyPercent=?-XX:ParallelGCThreads=#CPU-XX:ConcGCThreads=#CPU-XX:+ExplicitGCInvokesConcurrent-XX:+ParallelRefProcEnabled-XX:+UseCompressedOopsjvm.optionsjvm.options-XX:HeapDumpPath= <dir with enough free space>-XX:ErrorFile= <custom dir>-Djava.io.tmpdir= <custom dir>-XX:-UseBiasedLocking-XX:+UseTLAB-XX:+ResizeTLAB-XX:+PerfDisableSharedMem-XX:+AlwaysPreTouch...Backport from C* 3.0 41. num_tokens: 256native_transport_max_threads: 256 or 128compaction_throughput_mb_per_sec: 64concurrent_compactors: 4 or 2concurrent_reads: 64concurrent_writes: 128 or 64concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6 or 4memtable_cleanup_threshold: 0.6, 0.5 or 0.4memtable_flush_writers: 4 or 2trickle_fsync: truetrickle_fsync_interval_in_kb: 10240dynamic_snitch_badness_threshold: 2.0internode_compression: dcAWS nodesAWS nodesHeapc3.4xlarge: 15 GBi2.2xlarge: 24 GB 42. EBS volume != diskcompaction_throughput_mb_per_sec: 32concurrent_compactors: 4concurrent_reads: 32concurrent_writes: 64concurrent_counter_writes: 64trickle_fsync_interval_in_kb: 1024AWS nodesAWS nodesHeapc4.4xlarge: 15 GB 43. num_tokens: 8initial_token: ...native_transport_max_threads: 512compaction_throughput_mb_per_sec: 128concurrent_compactors: 4concurrent_reads: 64concurrent_writes: 128concurrent_counter_writes: 128hinted_handoff_throttle_in_kb: 10240max_hints_delivery_threads: 6memtable_cleanup_threshold: 0.6memtable_flush_writers: 8trickle_fsync: truetrickle_fsync_interval_in_kb: 10240Hardware nodesHardware nodesMore on this laterHeap: 24 GB 44. Why 8 tokens?Better repair performance, important for Big DataEvenly distributed tokens, stored in a Chef data bagHardware nodesHardware nodes./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4{"192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901","192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202","192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503"}https://github.com/rhardouin/cassandra-scriptsWatch out! Know the drawbacks 45. Small entries, lots of readscompression = {'chunk_length_kb': '4','sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}+ nodetool scrub (few GB)CompressionCompression 46. Disabled on 2 small clustersdynamic_snitch: falseLess hop countDynamic SnitchDynamic Snitch 47. Client side latencyDynamic SnitchDynamic SnitchP95P75Mean 48. Which node to decommission?DownscaleDownscale 49. Clients 50. Scala appsDataStax driver wrapperSpark & Spark streamingDataStax Spark Cassandra Connector 51. DataStax driver policyLatencyAwarePolicy TokenAwarePolicy→LatencyAwarePolicyHotspots due to premature nodes evictionNeeds thorough tuning and steady workloadWe drop itTokenAwarePolicyShuffle replicas depending on CL 52. For cross-region scheduled jobsVPN between AWS regions20 executors with 6GB RAMoutput.consistency.level = (LOCAL_)ONEoutput.concurrent.writes = 50connection.compression = LZ4 53. Useless writes99% of empty unlogged batches on one DCWhat an optimization! 54. VI. Tools 55. {Parallel SSH + cron} on steroidsSecurityHistorywho/what/when/whyOutput is keptCQL migrationRolling restartNodetool or JMX commandsBackup and snapshot jobs“Job Scheduler &Runbook Automation”We added a “comment” field 56. Scheduled range repairSegments: up to 20,000 for TB tablesHosted fork for C* 2.1We will probably switch to TLP’s forkWe do not use incremental repairsSee fix in C* 4.0 57. cassandra_snapshotterBackup on S3Scheduled with RundeckWe created and use a forkSome PR merged upstreamRestore PR still to be merged 58. Logs management"C* " and "out of sync""C* " and "new session: will sync" | count...Alerts on pattern"C* " and "[ERROR]""C* " and "[WARN]" and not ( … )... 59. VII. C’est la vie 60. Cassandraissues & failures 61. OS reboot… seems harmless, right?Cassandra service enabledWant a clue?C* 2.0 + countersUpgrade to C* 2.1 was a reliefWithout any obvious reason 62. Upgrade 2.0 2.1→LCS cluster sufferedHigh loadPending compactions was growingSwitch to off heap memtableLess GC => less loadReduce clients loadBetter after sstables upgradeTook days 63. Upgrade 2.0 2.1→Lots of NTR “All time blocked”NTR queue undersized for our workload128 (hard coded)We add a property to test CASSANDRA-11363and set value higher and higher… up to 4096NTR pool needs to be sized accordingly 64. After replacing nodesDELETE FROM system.peersWHERE peer = '<replaced node>';Used by DataStax Driver for auto discovery 65. AWSissues & failures 66. When you have to shoot, shoot, don’t talk!The Good, the Bad and the Ugly 67. We’ll shoot your instanceWe shot your instanceEBS volume latency spikeEBS volume unreachable 68. SSDSSD 69. SSDSSDdstat -tnvlr 70. Hardwareissues & failures 71. One SSD failedCPUs suddendly became slow onone serverSmart Array Battery BIOS bugYup, not a SSD... 72. VIII. A light fork 73. Why a fork?1. Need to add a patch ASAPHigh Blocked NTR CASSANDRA-11363Require to deploy from source2. Why not backport interesting tickets?3. Why not add small features/fixes?Expose tasks queue length via JMX CASSANDRA-12758You betcha! 74. A tiny forkWe keep it as small as possible to fit our needsEven smaller when we will upgrade to C* 3.0Backports will be obsolete 75. « Hey, if your fork is tiny,is it that useful? » 76. One example: repair 77. Backport ofCASSANDRA-12580Paulo Motta: log2vs ln 78. Get these info without DEBUG burdenOriginal fixOne-liner fix 79. The most impressive result for a set of tables:Before: 23 daysWith CASSANDRA-12580: 16 hoursLongest repair for a table: 2,5 daysImpossible to repair this table before the patchFit in gc_grace_seconds 80. It was a critical fix for usIt should have landed in 2.1.17 IMHO*Repair is a mandatory operation in many use casesPaulo already made the patch for 2.1C* 2.1 is widely used[*] Full post: http://www.mail-archive.com/user@cassandra.apache.org/msg49344.html 81. « Why bother with backports?Why not upgrade to 3.0? » 82. Because C* is critical for our businessWe don’t need fancy stuff (SASI, MV, UDF, ...)We just want a rock solid scalable DBC* 2.1 does the job for the time beingWe p la n to up grade to C * 3 .0 i n 201 7We wi l l do th o ro ugh te s ts ; - ) 83. « What about C* 2.2? » 84. C* 2.2 has some nice improvements:Boostrapping with LCS: send source sstable level 1Range movement causes CPU & performance impact 2Resumable bootstrap/rebuild streaming 3[1] CASSANDRA-7460[2] CASSANDRA-9258[3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810But migration path 2.2 3.0 is risky→Just my opinion based on users mailing listDSE never used 2.2 85. Questions? 86. Thanks!C* devs for their awesome workZenika for hosting this meetupYou for being here! Suggestions Office 365 : L'essentiel de PowerPointCours en ligne - LinkedIn Learning Faire un bon pitchCours en ligne - LinkedIn Learning Développer son assertivitéCours en ligne - LinkedIn Learning Cassandra advanced data modelingRomain Hardouin Cassandra: Open Source Bigtable + Dynamojbellis The Be-All, End-All List of Small Business Tax DeductionsWagepoint Advanced Apache Cassandra Operations with JMXzznate Troubleshooting RabbitMQ and services that use itMichael Klishin Cassandra Troubleshooting 3.0J.B. Langston Cassandra trainingAndrás Fehér English Español Português Français Deutsch À propos Blog Conditions générales Confidentialité Droits d’auteur LinkedIn Corporation © 2019 Clipboards publics comportant cette diapositiveAucun clipboard public n’a été trouvé avec cette diapositiveSélectionnez un autre clipboard ×Il semblerait que vous ayez déjà ajouté cette diapositive à .Créer un clipboardVous avez clippé votre première diapositive ! En clippant ainsi les diapos qui vous intéressent, vous pourrez les revoir plus tard. Personnalisez le nom d’un clipboard pour mettre de côté vos diapositives. Description Visibilité Le clipboard est visible par tous

Illustration Image
Cassandra at teads

Ce diaporama a bien été signalé.

Cassandra at teads
Cassandra @
Lyon Cassandra Users
Romain Hardouin - Cassandra architect @ Teads
2017-02-16
I.
II.
III.
IV.
V.
VI.
VII.
VIII.
Cassandra @ Teads
About Teads
Architecture
Provisioning
Monitoring & alerting
Tuning
Too...
I. About Teads
Teads is the
inventor of
native video
advertising
With inRead,
an award-winning
format*
*IPA Media owner survey 2014, IAB ...
27 offices
in 21 countries
500+
Global employees
1.2B users
Global reach
90+
R&D employees
Teads growth
Tracking events
Advertisers (to name a few)
Publishers (to name a few)
II. Architecture
Custom C* 2.1.16

C* 3.0 jvm.options

C* 2.2 logback

Backports

Patches
Apache Cassandra versionApache Cassandra vers...
UsageUsage
Up to 940K qps: Writes vs Reads
TopologyTopology
2 regions: EU & US

3rd
region APAC coming soon
4 clusters
7 DC
110 nodes

Up to 150 with temporary DCs...
AWS nodes
i2.2xlarge
8 vCPU 61GB
2 x 800 GB attached SSD in RAID0
c3.4xlarge
16 vCPU 30GB
2 x 160 GB attached SSD in RAID0
c4.4xlarg...
20 x c4.4xlarge with SSD GP2

3.4 TB data 10,000 IOPS⇒ 16KB

1 TB commitlog 3,000 IOPS⇒ 16KB
25 tables: batch + real tim...
Physical nodes
HP Apollo XL170R Gen9
12 CPU Xeon @ 2.60GHz
128 GB RAM
3 x 1,5 TB High-end SSD in RAID0
Hardware nodesHardware nodes
For B...
DC/Cluster split
Instance type changeInstance type change
20 x i2.2xlarge 20 x c3.4xlarge
Counters
Cheaper and more CPUs
Counters Rebuild
D...
Workload isolationWorkload isolation
20 x i2.2xlarge 20 x c3.4xlarge
Counters +
Big Data
Counters
20 x c4.4xlarge
Big Data...
Workload isolationWorkload isolation
20 x c4.4xlarge
Big Data
EBS
Step 2: Cluster split
Big DataAWS Direct Connect
Data model
“KISS” principle
No fancy stuff

No secondary index

No list/set/tuple

No UDT
III. Provisioning
Capistrano Chef→
Custom Cookbooks:

C*

C* tools

C* reaper

Datadog wrapper
Chef provisioning to spawn a cluster
NowN...
C* cookbook

michaelklishin/cassandra-chef-cookbook

+ Teads custom wrapper
Terraform + Chef provisioner
FutureFuture
IV. Monitoring & alerting
PastPast
OpsCenter (Free)
Turnkey dashboards
Support is reactive
Main metrics only
Per host graphs

impossible with many hosts
Ring view
More than monitoring
Lots of metrics
Still lacks some metrics
Dashboard creation: no templates
Agent is heavy
Fr...
All metrics you want
Dashboard creation
●
Templating

TimeBoard vs ScreenBoard
Graph creation

Aggregation, trend, rate,...
Now we can easily

Find outliers

Compare a node to average

Compare two DCs

Explore a node’s metrics

Create overvi...
Datadog’s cassandra.yamlDatadog’s cassandra.yaml
- include:
bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=...
ScreenBoardScreenBoard
TimeBoardTimeBoard
alpha
beta
gamma
delta
epsilon
zeta
eta
alpha
ExampleExample
Hints monitoring during maintenance on physical nodes
Storage
Streaming
Datadog Alerting
Down node
Exceptions
Commitlog size
High latency
High GC
High IO Wait
High Pendings
Many hints
Long thrif...
V. Tuning
Java 8
CMS G1→
cassandra-env.sh
-Dcassandra.max_queued_native_transport_requests=4096
-Dcassandra.fd_initial_value_ms=4000...
GC logs enabled
-XX:MaxGCPauseMillis=200
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:G1HeapRegionSize=32m
-XX:G1HeapWastePerc...
num_tokens: 256
native_transport_max_threads: 256 or 128
compaction_throughput_mb_per_sec: 64
concurrent_compactors: 4 or ...
EBS volume != disk
compaction_throughput_mb_per_sec: 32
concurrent_compactors: 4
concurrent_reads: 32
concurrent_writes: 6...
num_tokens: 8
initial_token: ...
native_transport_max_threads: 512
compaction_throughput_mb_per_sec: 128
concurrent_compac...
Why 8 tokens?
Better repair performance, important for Big Data
Evenly distributed tokens, stored in a Chef data bag
Hardw...
Small entries, lots of reads
compression = {
'chunk_length_kb': '4',
'sstable_compression': 'org.apache.cassandra.io.compr...
Disabled on 2 small clusters
dynamic_snitch: false
Less hop count
Dynamic SnitchDynamic Snitch
Client side latency
Dynamic SnitchDynamic Snitch
P95
P75
Mean
Which node to decommission?
DownscaleDownscale
Clients
Scala apps
DataStax driver wrapper
Spark & Spark streaming
DataStax Spark Cassandra Connector
DataStax driver policy
LatencyAwarePolicy TokenAwarePolicy→
LatencyAwarePolicy

Hotspots due to premature nodes eviction
...
For cross-region scheduled jobs
VPN between AWS regions
20 executors with 6GB RAM
output.consistency.level = (LOCAL_)ONE
o...
Useless writes
99% of empty unlogged batches on one DC
What an optimization!
VI. Tools
{Parallel SSH + cron} on steroids

Security

History
who/what/when/why
Output is kept
CQL migration
Rolling restart
Node...
Scheduled range repair
Segments: up to 20,000 for TB tables
Hosted fork for C* 2.1
We will probably switch to TLP’s fork
W...
cassandra_snapshotter

Backup on S3

Scheduled with Rundeck
We created and use a fork

Some PR merged upstream

Restor...
Logs management
"C* " and "out of sync"
"C* " and "new session: will sync" | count
...
Alerts on pattern
"C* " and "[ERROR...
VII. C’est la vie
Cassandra
issues & failures
OS reboot… seems harmless, right?
Cassandra service enabled
Want a clue?
C* 2.0 + counters
Upgrade to C* 2.1 was a relief
...
Upgrade 2.0 2.1→
LCS cluster suffered

High load

Pending compactions was growing
Switch to off heap memtable

Less GC ...
Upgrade 2.0 2.1→
Lots of NTR “All time blocked”
NTR queue undersized for our workload

128 (hard coded)
We add a property...
After replacing nodes
DELETE FROM system.peers
WHERE peer = '<replaced node>';
Used by DataStax Driver for auto discovery
AWS
issues & failures
When you have to shoot, shoot, don’t talk!
The Good, the Bad and the Ugly
We’ll shoot your instance
We shot your instance
EBS volume latency spike
EBS volume unreachable
SSDSSD
SSDSSD
dstat -tnvlr
Hardware
issues & failures
One SSD failed
CPUs suddendly became slow on
one server
Smart Array Battery BIOS bug
Yup, not a SSD...
VIII. A light fork
Why a fork?
1. Need to add a patch ASAP

High Blocked NTR CASSANDRA-11363

Require to deploy from source
2. Why not back...
A tiny fork
We keep it as small as possible to fit our needs
Even smaller when we will upgrade to C* 3.0

Backports will ...
« Hey, if your fork is tiny,
is it that useful? »
One example: repair
Backport of
CASSANDRA-12580
Paulo Motta: log2
vs ln
Get these info without DEBUG burden
Original fix
One-liner fix
The most impressive result for a set of tables:

Before: 23 days

With CASSANDRA-12580: 16 hours
Longest repair for a ta...
It was a critical fix for us
It should have landed in 2.1.17 IMHO*

Repair is a mandatory operation in many use cases

P...
« Why bother with backports?
Why not upgrade to 3.0? »
Because C* is critical for our business
We don’t need fancy stuff (SASI, MV, UDF, ...)
We just want a rock solid scalable ...
« What about C* 2.2? »
C* 2.2 has some nice improvements:

Boostrapping with LCS: send source sstable level 1

Range movement causes CPU & perf...
Questions?
Thanks!
C* devs for their awesome work
Zenika for hosting this meetup
You for being here!

Prochain SlideShare

Chargement dans…5

×

  1. 1. Cassandra @ Lyon Cassandra Users Romain Hardouin - Cassandra architect @ Teads 2017-02-16
  2. 2. I. II. III. IV. V. VI. VII. VIII. Cassandra @ Teads About Teads Architecture Provisioning Monitoring & alerting Tuning Tools C’est la vie A light fork
  3. 3. I. About Teads
  4. 4. Teads is the inventor of native video advertising With inRead, an award-winning format* *IPA Media owner survey 2014, IAB recognized format
  5. 5. 27 offices in 21 countries 500+ Global employees 1.2B users Global reach 90+ R&D employees
  6. 6. Teads growth Tracking events
  7. 7. Advertisers (to name a few)
  8. 8. Publishers (to name a few)
  9. 9. II. Architecture
  10. 10. Custom C* 2.1.16  C* 3.0 jvm.options  C* 2.2 logback  Backports  Patches Apache Cassandra versionApache Cassandra version
  11. 11. UsageUsage Up to 940K qps: Writes vs Reads
  12. 12. TopologyTopology 2 regions: EU & US  3rd region APAC coming soon 4 clusters 7 DC 110 nodes  Up to 150 with temporary DCs HP server blades 1 cluster 18 nodes
  13. 13. AWS nodes
  14. 14. i2.2xlarge 8 vCPU 61GB 2 x 800 GB attached SSD in RAID0 c3.4xlarge 16 vCPU 30GB 2 x 160 GB attached SSD in RAID0 c4.4xlarge 16 vCPU 30GB EBS 3.4 TB + 1 TB AWS instance typesAWS instance types Tons of counters Big Data, wide rows Many billions keys, LCS with TTL
  15. 15. 20 x c4.4xlarge with SSD GP2  3.4 TB data 10,000 IOPS⇒ 16KB  1 TB commitlog 3,000 IOPS⇒ 16KB 25 tables: batch + real time Temporary DC Cheap storage, great for STCS Snapshots (S3 backup) No coupling between disks and CPU/RAM High latency => high I/O wait Throughput: 160 MB/s Unsteady performances More on EBS nodesMore on EBS nodes
  16. 16. Physical nodes
  17. 17. HP Apollo XL170R Gen9 12 CPU Xeon @ 2.60GHz 128 GB RAM 3 x 1,5 TB High-end SSD in RAID0 Hardware nodesHardware nodes For Big Data, supersedes EBS DC
  18. 18. DC/Cluster split
  19. 19. Instance type changeInstance type change 20 x i2.2xlarge 20 x c3.4xlarge Counters Cheaper and more CPUs Counters Rebuild DC X DC Y
  20. 20. Workload isolationWorkload isolation 20 x i2.2xlarge 20 x c3.4xlarge Counters + Big Data Counters 20 x c4.4xlarge Big Data EBS Step 1: DC split DC A DC B DC C Rebuild +
  21. 21. Workload isolationWorkload isolation 20 x c4.4xlarge Big Data EBS Step 2: Cluster split Big DataAWS Direct Connect
  22. 22. Data model
  23. 23. “KISS” principle No fancy stuff  No secondary index  No list/set/tuple  No UDT
  24. 24. III. Provisioning
  25. 25. Capistrano Chef→ Custom Cookbooks:  C*  C* tools  C* reaper  Datadog wrapper Chef provisioning to spawn a cluster NowNow
  26. 26. C* cookbook  michaelklishin/cassandra-chef-cookbook  + Teads custom wrapper Terraform + Chef provisioner FutureFuture
  27. 27. IV. Monitoring & alerting
  28. 28. PastPast OpsCenter (Free)
  29. 29. Turnkey dashboards Support is reactive Main metrics only Per host graphs  impossible with many hosts
  30. 30. Ring view More than monitoring Lots of metrics Still lacks some metrics Dashboard creation: no templates Agent is heavy Free version limitations:  Data stored in production cluster  Apache C* <= 2.1 only DataStax OpsCenter Free version (v5)
  31. 31. All metrics you want Dashboard creation ● Templating  TimeBoard vs ScreenBoard Graph creation  Aggregation, trend, rate, anomaly detection No turnkey dashboards yet  May change: TLP templates Additional fees if >350 metrics  We need to increase this limit for our use case
  32. 32. Now we can easily  Find outliers  Compare a node to average  Compare two DCs  Explore a node’s metrics  Create overview dashboards  Create advanced dashboards for troubleshooting
  33. 33. Datadog’s cassandra.yamlDatadog’s cassandra.yaml - include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count - include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile - include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize - include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued
  34. 34. ScreenBoardScreenBoard
  35. 35. TimeBoardTimeBoard alpha beta gamma delta epsilon zeta eta alpha
  36. 36. ExampleExample Hints monitoring during maintenance on physical nodes Storage Streaming
  37. 37. Datadog Alerting Down node Exceptions Commitlog size High latency High GC High IO Wait High Pendings Many hints Long thrift connections Clock out of sync Disk space … Don’t miss this one Don’t forget /
  38. 38. V. Tuning
  39. 39. Java 8 CMS G1→ cassandra-env.sh -Dcassandra.max_queued_native_transport_requests=4096 -Dcassandra.fd_initial_value_ms=4000 -Dcassandra.fd_max_interval_ms=4000
  40. 40. GC logs enabled -XX:MaxGCPauseMillis=200 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:G1HeapRegionSize=32m -XX:G1HeapWastePercent=25 -XX:InitiatingHeapOccupancyPercent=? -XX:ParallelGCThreads=#CPU -XX:ConcGCThreads=#CPU -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseCompressedOops jvm.optionsjvm.options -XX:HeapDumpPath= <dir with enough free space> -XX:ErrorFile= <custom dir> -Djava.io.tmpdir= <custom dir> -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch ... Backport from C* 3.0
  41. 41. num_tokens: 256 native_transport_max_threads: 256 or 128 compaction_throughput_mb_per_sec: 64 concurrent_compactors: 4 or 2 concurrent_reads: 64 concurrent_writes: 128 or 64 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 or 4 memtable_cleanup_threshold: 0.6, 0.5 or 0.4 memtable_flush_writers: 4 or 2 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 dynamic_snitch_badness_threshold: 2.0 internode_compression: dc AWS nodesAWS nodes Heap c3.4xlarge: 15 GB i2.2xlarge: 24 GB
  42. 42. EBS volume != disk compaction_throughput_mb_per_sec: 32 concurrent_compactors: 4 concurrent_reads: 32 concurrent_writes: 64 concurrent_counter_writes: 64 trickle_fsync_interval_in_kb: 1024 AWS nodesAWS nodes Heap c4.4xlarge: 15 GB
  43. 43. num_tokens: 8 initial_token: ... native_transport_max_threads: 512 compaction_throughput_mb_per_sec: 128 concurrent_compactors: 4 concurrent_reads: 64 concurrent_writes: 128 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 memtable_cleanup_threshold: 0.6 memtable_flush_writers: 8 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 Hardware nodesHardware nodes More on this laterHeap: 24 GB
  44. 44. Why 8 tokens? Better repair performance, important for Big Data Evenly distributed tokens, stored in a Chef data bag Hardware nodesHardware nodes ./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4 { "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503" } https://github.com/rhardouin/cassandra-scripts Watch out! Know the drawbacks
  45. 45. Small entries, lots of reads compression = { 'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } + nodetool scrub (few GB) CompressionCompression
  46. 46. Disabled on 2 small clusters dynamic_snitch: false Less hop count Dynamic SnitchDynamic Snitch
  47. 47. Client side latency Dynamic SnitchDynamic Snitch P95 P75 Mean
  48. 48. Which node to decommission? DownscaleDownscale
  49. 49. Clients
  50. 50. Scala apps DataStax driver wrapper Spark & Spark streaming DataStax Spark Cassandra Connector
  51. 51. DataStax driver policy LatencyAwarePolicy TokenAwarePolicy→ LatencyAwarePolicy  Hotspots due to premature nodes eviction  Needs thorough tuning and steady workload  We drop it TokenAwarePolicy  Shuffle replicas depending on CL
  52. 52. For cross-region scheduled jobs VPN between AWS regions 20 executors with 6GB RAM output.consistency.level = (LOCAL_)ONE output.concurrent.writes = 50 connection.compression = LZ4
  53. 53. Useless writes 99% of empty unlogged batches on one DC What an optimization!
  54. 54. VI. Tools
  55. 55. {Parallel SSH + cron} on steroids  Security  History who/what/when/why Output is kept CQL migration Rolling restart Nodetool or JMX commands Backup and snapshot jobs “Job Scheduler & Runbook Automation” We added a “comment” field
  56. 56. Scheduled range repair Segments: up to 20,000 for TB tables Hosted fork for C* 2.1 We will probably switch to TLP’s fork We do not use incremental repairs See fix in C* 4.0
  57. 57. cassandra_snapshotter  Backup on S3  Scheduled with Rundeck We created and use a fork  Some PR merged upstream  Restore PR still to be merged
  58. 58. Logs management "C* " and "out of sync" "C* " and "new session: will sync" | count ... Alerts on pattern "C* " and "[ERROR]" "C* " and "[WARN]" and not ( … ) ...
  59. 59. VII. C’est la vie
  60. 60. Cassandra issues & failures
  61. 61. OS reboot… seems harmless, right? Cassandra service enabled Want a clue? C* 2.0 + counters Upgrade to C* 2.1 was a relief Without any obvious reason
  62. 62. Upgrade 2.0 2.1→ LCS cluster suffered  High load  Pending compactions was growing Switch to off heap memtable  Less GC => less load Reduce clients load Better after sstables upgrade  Took days
  63. 63. Upgrade 2.0 2.1→ Lots of NTR “All time blocked” NTR queue undersized for our workload  128 (hard coded) We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096 NTR pool needs to be sized accordingly
  64. 64. After replacing nodes DELETE FROM system.peers WHERE peer = '<replaced node>'; Used by DataStax Driver for auto discovery
  65. 65. AWS issues & failures
  66. 66. When you have to shoot, shoot, don’t talk! The Good, the Bad and the Ugly
  67. 67. We’ll shoot your instance We shot your instance EBS volume latency spike EBS volume unreachable
  68. 68. SSDSSD
  69. 69. SSDSSD dstat -tnvlr
  70. 70. Hardware issues & failures
  71. 71. One SSD failed CPUs suddendly became slow on one server Smart Array Battery BIOS bug Yup, not a SSD...
  72. 72. VIII. A light fork
  73. 73. Why a fork? 1. Need to add a patch ASAP  High Blocked NTR CASSANDRA-11363  Require to deploy from source 2. Why not backport interesting tickets? 3. Why not add small features/fixes?  Expose tasks queue length via JMX CASSANDRA-12758 You betcha!
  74. 74. A tiny fork We keep it as small as possible to fit our needs Even smaller when we will upgrade to C* 3.0  Backports will be obsolete
  75. 75. « Hey, if your fork is tiny, is it that useful? »
  76. 76. One example: repair
  77. 77. Backport of CASSANDRA-12580 Paulo Motta: log2 vs ln
  78. 78. Get these info without DEBUG burden Original fix One-liner fix
  79. 79. The most impressive result for a set of tables:  Before: 23 days  With CASSANDRA-12580: 16 hours Longest repair for a table: 2,5 days  Impossible to repair this table before the patch  Fit in gc_grace_seconds
  80. 80. It was a critical fix for us It should have landed in 2.1.17 IMHO*  Repair is a mandatory operation in many use cases  Paulo already made the patch for 2.1  C* 2.1 is widely used [*] Full post: http://www.mail-archive.com/user@cassandra.apache.org/msg49344.html
  81. 81. « Why bother with backports? Why not upgrade to 3.0? »
  82. 82. Because C* is critical for our business We don’t need fancy stuff (SASI, MV, UDF, ...) We just want a rock solid scalable DB C* 2.1 does the job for the time being We p la n to up grade to C * 3 .0 i n 201 7 We wi l l do th o ro ugh te s ts ; - )
  83. 83. « What about C* 2.2? »
  84. 84. C* 2.2 has some nice improvements:  Boostrapping with LCS: send source sstable level 1  Range movement causes CPU & performance impact 2  Resumable bootstrap/rebuild streaming 3 [1] CASSANDRA-7460 [2] CASSANDRA-9258 [3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810 But migration path 2.2 3.0 is risky→  Just my opinion based on users mailing list  DSE never used 2.2
  85. 85. Questions?
  86. 86. Thanks! C* devs for their awesome work Zenika for hosting this meetup You for being here!

Related Articles

case.study
examples
cassandra

DataStax-Examples/iot-demo-java

DataStax-Examples

10/23/2020

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

case.study