Ce diaporama a bien été signalé.
Cassandra at teads
Prochain SlideShare
Chargement dans…5
×
Aucun téléchargement
Aucune remarque pour cette diapositive
- 1. Cassandra @ Lyon Cassandra Users Romain Hardouin - Cassandra architect @ Teads 2017-02-16
- 2. I. II. III. IV. V. VI. VII. VIII. Cassandra @ Teads About Teads Architecture Provisioning Monitoring & alerting Tuning Tools C’est la vie A light fork
- 3. I. About Teads
- 4. Teads is the inventor of native video advertising With inRead, an award-winning format* *IPA Media owner survey 2014, IAB recognized format
- 5. 27 offices in 21 countries 500+ Global employees 1.2B users Global reach 90+ R&D employees
- 6. Teads growth Tracking events
- 7. Advertisers (to name a few)
- 8. Publishers (to name a few)
- 9. II. Architecture
- 10. Custom C* 2.1.16 C* 3.0 jvm.options C* 2.2 logback Backports Patches Apache Cassandra versionApache Cassandra version
- 11. UsageUsage Up to 940K qps: Writes vs Reads
- 12. TopologyTopology 2 regions: EU & US 3rd region APAC coming soon 4 clusters 7 DC 110 nodes Up to 150 with temporary DCs HP server blades 1 cluster 18 nodes
- 13. AWS nodes
- 14. i2.2xlarge 8 vCPU 61GB 2 x 800 GB attached SSD in RAID0 c3.4xlarge 16 vCPU 30GB 2 x 160 GB attached SSD in RAID0 c4.4xlarge 16 vCPU 30GB EBS 3.4 TB + 1 TB AWS instance typesAWS instance types Tons of counters Big Data, wide rows Many billions keys, LCS with TTL
- 15. 20 x c4.4xlarge with SSD GP2 3.4 TB data 10,000 IOPS⇒ 16KB 1 TB commitlog 3,000 IOPS⇒ 16KB 25 tables: batch + real time Temporary DC Cheap storage, great for STCS Snapshots (S3 backup) No coupling between disks and CPU/RAM High latency => high I/O wait Throughput: 160 MB/s Unsteady performances More on EBS nodesMore on EBS nodes
- 16. Physical nodes
- 17. HP Apollo XL170R Gen9 12 CPU Xeon @ 2.60GHz 128 GB RAM 3 x 1,5 TB High-end SSD in RAID0 Hardware nodesHardware nodes For Big Data, supersedes EBS DC
- 18. DC/Cluster split
- 19. Instance type changeInstance type change 20 x i2.2xlarge 20 x c3.4xlarge Counters Cheaper and more CPUs Counters Rebuild DC X DC Y
- 20. Workload isolationWorkload isolation 20 x i2.2xlarge 20 x c3.4xlarge Counters + Big Data Counters 20 x c4.4xlarge Big Data EBS Step 1: DC split DC A DC B DC C Rebuild +
- 21. Workload isolationWorkload isolation 20 x c4.4xlarge Big Data EBS Step 2: Cluster split Big DataAWS Direct Connect
- 22. Data model
- 23. “KISS” principle No fancy stuff No secondary index No list/set/tuple No UDT
- 24. III. Provisioning
- 25. Capistrano Chef→ Custom Cookbooks: C* C* tools C* reaper Datadog wrapper Chef provisioning to spawn a cluster NowNow
- 26. C* cookbook michaelklishin/cassandra-chef-cookbook + Teads custom wrapper Terraform + Chef provisioner FutureFuture
- 27. IV. Monitoring & alerting
- 28. PastPast OpsCenter (Free)
- 29. Turnkey dashboards Support is reactive Main metrics only Per host graphs impossible with many hosts
- 30. Ring view More than monitoring Lots of metrics Still lacks some metrics Dashboard creation: no templates Agent is heavy Free version limitations: Data stored in production cluster Apache C* <= 2.1 only DataStax OpsCenter Free version (v5)
- 31. All metrics you want Dashboard creation ● Templating TimeBoard vs ScreenBoard Graph creation Aggregation, trend, rate, anomaly detection No turnkey dashboards yet May change: TLP templates Additional fees if >350 metrics We need to increase this limit for our use case
- 32. Now we can easily Find outliers Compare a node to average Compare two DCs Explore a node’s metrics Create overview dashboards Create advanced dashboards for troubleshooting
- 33. Datadog’s cassandra.yamlDatadog’s cassandra.yaml - include: bean_regex: org.apache.cassandra.metrics:type=ReadRepair,name=* attribute: - Count - include: bean_regex: org.apache.cassandra.metrics:type=CommitLog,name=(WaitingOnCommit|WaitingOnSegmentAllocation) attribute: - Count - 99thPercentile - include: bean: org.apache.cassandra.metrics:type=CommitLog,name=TotalCommitLogSize - include: bean: org.apache.cassandra.metrics:type=ThreadPools,path=transport,scope=Native-Transport-Requests,name=MaxTasksQueued attribute: Value: alias: cassandra.ntr.MaxTasksQueued
- 34. ScreenBoardScreenBoard
- 35. TimeBoardTimeBoard alpha beta gamma delta epsilon zeta eta alpha
- 36. ExampleExample Hints monitoring during maintenance on physical nodes Storage Streaming
- 37. Datadog Alerting Down node Exceptions Commitlog size High latency High GC High IO Wait High Pendings Many hints Long thrift connections Clock out of sync Disk space … Don’t miss this one Don’t forget /
- 38. V. Tuning
- 39. Java 8 CMS G1→ cassandra-env.sh -Dcassandra.max_queued_native_transport_requests=4096 -Dcassandra.fd_initial_value_ms=4000 -Dcassandra.fd_max_interval_ms=4000
- 40. GC logs enabled -XX:MaxGCPauseMillis=200 -XX:G1RSetUpdatingPauseTimePercent=5 -XX:G1HeapRegionSize=32m -XX:G1HeapWastePercent=25 -XX:InitiatingHeapOccupancyPercent=? -XX:ParallelGCThreads=#CPU -XX:ConcGCThreads=#CPU -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseCompressedOops jvm.optionsjvm.options -XX:HeapDumpPath= <dir with enough free space> -XX:ErrorFile= <custom dir> -Djava.io.tmpdir= <custom dir> -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch ... Backport from C* 3.0
- 41. num_tokens: 256 native_transport_max_threads: 256 or 128 compaction_throughput_mb_per_sec: 64 concurrent_compactors: 4 or 2 concurrent_reads: 64 concurrent_writes: 128 or 64 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 or 4 memtable_cleanup_threshold: 0.6, 0.5 or 0.4 memtable_flush_writers: 4 or 2 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 dynamic_snitch_badness_threshold: 2.0 internode_compression: dc AWS nodesAWS nodes Heap c3.4xlarge: 15 GB i2.2xlarge: 24 GB
- 42. EBS volume != disk compaction_throughput_mb_per_sec: 32 concurrent_compactors: 4 concurrent_reads: 32 concurrent_writes: 64 concurrent_counter_writes: 64 trickle_fsync_interval_in_kb: 1024 AWS nodesAWS nodes Heap c4.4xlarge: 15 GB
- 43. num_tokens: 8 initial_token: ... native_transport_max_threads: 512 compaction_throughput_mb_per_sec: 128 concurrent_compactors: 4 concurrent_reads: 64 concurrent_writes: 128 concurrent_counter_writes: 128 hinted_handoff_throttle_in_kb: 10240 max_hints_delivery_threads: 6 memtable_cleanup_threshold: 0.6 memtable_flush_writers: 8 trickle_fsync: true trickle_fsync_interval_in_kb: 10240 Hardware nodesHardware nodes More on this laterHeap: 24 GB
- 44. Why 8 tokens? Better repair performance, important for Big Data Evenly distributed tokens, stored in a Chef data bag Hardware nodesHardware nodes ./vnodes_token_generator.py --json --indent 2 --servers hosts_interleaved_racks.txt 4 { "192.168.1.1": "-9223372036854775808,-4611686018427387905,-2,4611686018427387901", "192.168.2.1": "-7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202", "192.168.3.1": "-6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503" } https://github.com/rhardouin/cassandra-scripts Watch out! Know the drawbacks
- 45. Small entries, lots of reads compression = { 'chunk_length_kb': '4', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } + nodetool scrub (few GB) CompressionCompression
- 46. Disabled on 2 small clusters dynamic_snitch: false Less hop count Dynamic SnitchDynamic Snitch
- 47. Client side latency Dynamic SnitchDynamic Snitch P95 P75 Mean
- 48. Which node to decommission? DownscaleDownscale
- 49. Clients
- 50. Scala apps DataStax driver wrapper Spark & Spark streaming DataStax Spark Cassandra Connector
- 51. DataStax driver policy LatencyAwarePolicy TokenAwarePolicy→ LatencyAwarePolicy Hotspots due to premature nodes eviction Needs thorough tuning and steady workload We drop it TokenAwarePolicy Shuffle replicas depending on CL
- 52. For cross-region scheduled jobs VPN between AWS regions 20 executors with 6GB RAM output.consistency.level = (LOCAL_)ONE output.concurrent.writes = 50 connection.compression = LZ4
- 53. Useless writes 99% of empty unlogged batches on one DC What an optimization!
- 54. VI. Tools
- 55. {Parallel SSH + cron} on steroids Security History who/what/when/why Output is kept CQL migration Rolling restart Nodetool or JMX commands Backup and snapshot jobs “Job Scheduler & Runbook Automation” We added a “comment” field
- 56. Scheduled range repair Segments: up to 20,000 for TB tables Hosted fork for C* 2.1 We will probably switch to TLP’s fork We do not use incremental repairs See fix in C* 4.0
- 57. cassandra_snapshotter Backup on S3 Scheduled with Rundeck We created and use a fork Some PR merged upstream Restore PR still to be merged
- 58. Logs management "C* " and "out of sync" "C* " and "new session: will sync" | count ... Alerts on pattern "C* " and "[ERROR]" "C* " and "[WARN]" and not ( … ) ...
- 59. VII. C’est la vie
- 60. Cassandra issues & failures
- 61. OS reboot… seems harmless, right? Cassandra service enabled Want a clue? C* 2.0 + counters Upgrade to C* 2.1 was a relief Without any obvious reason
- 62. Upgrade 2.0 2.1→ LCS cluster suffered High load Pending compactions was growing Switch to off heap memtable Less GC => less load Reduce clients load Better after sstables upgrade Took days
- 63. Upgrade 2.0 2.1→ Lots of NTR “All time blocked” NTR queue undersized for our workload 128 (hard coded) We add a property to test CASSANDRA-11363 and set value higher and higher… up to 4096 NTR pool needs to be sized accordingly
- 64. After replacing nodes DELETE FROM system.peers WHERE peer = '<replaced node>'; Used by DataStax Driver for auto discovery
- 65. AWS issues & failures
- 66. When you have to shoot, shoot, don’t talk! The Good, the Bad and the Ugly
- 67. We’ll shoot your instance We shot your instance EBS volume latency spike EBS volume unreachable
- 68. SSDSSD
- 69. SSDSSD dstat -tnvlr
- 70. Hardware issues & failures
- 71. One SSD failed CPUs suddendly became slow on one server Smart Array Battery BIOS bug Yup, not a SSD...
- 72. VIII. A light fork
- 73. Why a fork? 1. Need to add a patch ASAP High Blocked NTR CASSANDRA-11363 Require to deploy from source 2. Why not backport interesting tickets? 3. Why not add small features/fixes? Expose tasks queue length via JMX CASSANDRA-12758 You betcha!
- 74. A tiny fork We keep it as small as possible to fit our needs Even smaller when we will upgrade to C* 3.0 Backports will be obsolete
- 75. « Hey, if your fork is tiny, is it that useful? »
- 76. One example: repair
- 77. Backport of CASSANDRA-12580 Paulo Motta: log2 vs ln
- 78. Get these info without DEBUG burden Original fix One-liner fix
- 79. The most impressive result for a set of tables: Before: 23 days With CASSANDRA-12580: 16 hours Longest repair for a table: 2,5 days Impossible to repair this table before the patch Fit in gc_grace_seconds
- 80. It was a critical fix for us It should have landed in 2.1.17 IMHO* Repair is a mandatory operation in many use cases Paulo already made the patch for 2.1 C* 2.1 is widely used [*] Full post: http://www.mail-archive.com/user@cassandra.apache.org/msg49344.html
- 81. « Why bother with backports? Why not upgrade to 3.0? »
- 82. Because C* is critical for our business We don’t need fancy stuff (SASI, MV, UDF, ...) We just want a rock solid scalable DB C* 2.1 does the job for the time being We p la n to up grade to C * 3 .0 i n 201 7 We wi l l do th o ro ugh te s ts ; - )
- 83. « What about C* 2.2? »
- 84. C* 2.2 has some nice improvements: Boostrapping with LCS: send source sstable level 1 Range movement causes CPU & performance impact 2 Resumable bootstrap/rebuild streaming 3 [1] CASSANDRA-7460 [2] CASSANDRA-9258 [3] CASSANDRA-8838, CASSANDRA-8942, CASSANDRA-10810 But migration path 2.2 3.0 is risky→ Just my opinion based on users mailing list DSE never used 2.2
- 85. Questions?
- 86. Thanks! C* devs for their awesome work Zenika for hosting this meetup You for being here!
Clipboards publics comportant cette diapositive
Aucun clipboard public n’a été trouvé avec cette diapositive
Sélectionnez un autre clipboard
×
Il semblerait que vous ayez déjà ajouté cette diapositive à .
Créer un clipboard