Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

9/14/2017

Reading time:14 mins

Migration Best Practices: From RDBMS to Cassandra without a Hitch

by John Doe

Migration Best Practices: From RDBMS to Cassandra without a Hitch SlideShare Explore You Migration Best Practices: From RDBMS to Cassandra without a Hitch Upcoming SlideShare Loading in …5 × 1 Comment 19 Likes Statistics Notes Anshorimuslim Syuhada , Lead Engineer at eFishery at eFishery Ope Bakare , Senior Director, OCIO, Infrastructure Design, Engineering and Cloud Services at CVS Health at CVS Health benineng Diego Caravana , Self employed consultant, Software Developer at Self employed Leonardo Schmitt Alves , Analista de Programação Pleno na Buhler at Analista de Programação Pleno Show More No Downloads No notes for slide 1. Migration Best Practices: From RDBMS to Cassandra without a Hitch#Cassandra @doanduyhai 2. #Cassandra @doanduyhaiWho am I ?2DuyHai DoanAchillesCassandra Technical Advocate @ DatastaxFormer Java Developer @ Libon 3. #Cassandra @doanduyhaiAgenda•  Libon context•  Migration strategy•  Business code migration•  Data Modeling•  Take Away3 4. #Cassandra @doanduyhaiLibon Context 5. #Cassandra @doanduyhaiWhat is Libon ?•  Messaging app•  VOIP (out)•  Custom voicemail & greetings•  SMS/chat/file transfer•  Contacts matching5 6. #Cassandra @doanduyhaiContact Matching6Libon User 7. #Cassandra @doanduyhaiContact Matching7Libon User Friend 8. #Cassandra @doanduyhaiContact Matching8Libon User FriendContact matching 9. #Cassandra @doanduyhaiContact Matching9Libon User FriendAccept link 10. #Cassandra @doanduyhaiProject Context•  Application grew over the years10 11. #Cassandra @doanduyhaiProject Context•  Application grew over the years•  Already using Cassandra to handle events•  messaging / file sharing / SMS / notifications•  Cassandra R/W latencies ≈ 0,4 ms•  server response time under 10 ms11 12. #Cassandra @doanduyhaiProject Context•  About contacts …12 13. #Cassandra @doanduyhaiProject Context•  About contacts …•  stored as relational model in RDBMS (Oracle)13 14. #Cassandra @doanduyhaiProject Context•  About contacts …•  stored as relational model in RDBMS (Oracle)•  1 user ≈ 300 contacts14 15. #Cassandra @doanduyhaiProject Context•  About contacts …•  stored as relational model in RDBMS (Oracle)•  1 user ≈ 300 contacts•  with millions users ☞ billions of contacts to handle15 16. #Cassandra @doanduyhaiProject Context•  About contacts …•  stored as relational model in RDBMS (Oracle)•  1 user ≈ 300 contacts•  with millions users ☞ billions of contacts to handle•  query latency unpredictable16 17. #Cassandra @doanduyhai17 18. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS18 19. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices19 20. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices•  partitioning20 21. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices•  partitioning•  less joins, simplified relational model21 22. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices•  partitioning•  less joins, simplified relational model•  hardware capacity increased22 23. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices•  partitioning•  less joins, simplified relational model•  hardware capacity increased23That worked 24. #Cassandra @doanduyhaiFixing the problem•  Tune the RDBMS•  indices•  partitioning•  less joins, simplified relational model•  hardware capacity increased24That workedbut … 25. #Cassandra @doanduyhaiBack-end applicationRDBMS Cassandra25 26. #Cassandra @doanduyhaiBack-end applicationRDBMS Cassandra26We need tochoose 27. #Cassandra @doanduyhaiNext Challenges•  High Availability (DB failure, site failure …)27 28. #Cassandra @doanduyhaiNext Challenges•  High Availability (DB failure, site failure …)•  Predictable performance at scale28 29. #Cassandra @doanduyhaiNext Challenges•  High Availability (DB failure, site failure …)•  Predictable performance at scale•  Going to multi data-centers29 30. #Cassandra @doanduyhaiNext Challenges•  High Availability (DB failure, site failure …)•  Predictable performance at scale•  Going to multi data-centers☞ Cassandra, what else ?30 31. #Cassandra @doanduyhaiData Migration Strategy 32. #Cassandra @doanduyhaiObjectives•  No downtime32 33. #Cassandra @doanduyhaiObjectives•  No downtime•  No concurrency corner-cases 33 34. #Cassandra @doanduyhaiObjectives•  No downtime•  No concurrency corner-cases •  Safe rollback possible34 35. #Cassandra @doanduyhaiObjectives•  No downtime•  No concurrency corner-cases •  Safe rollback possible•  Replay-ability & resume-ability35 36. #Cassandra @doanduyhaiStrategy•  4 phases36 37. #Cassandra @doanduyhaiStrategy•  4 phases•  Write contacts to both data stores37 38. #Cassandra @doanduyhaiStrategy•  4 phases•  Write contacts to both data stores•  Old contacts migration38 39. #Cassandra @doanduyhaiStrategy•  4 phases•  Write contacts to both data stores•  Old contacts migration•  Switch to Cassandra (but keep RDBMS in case of…)39 40. #Cassandra @doanduyhaiStrategy•  4 phases•  Write contacts to both data stores•  Old contacts migration•  Switch to Cassandra (but keep RDBMS in case of…)•  Remove the RDBMS code40 41. #Cassandra @doanduyhaiMigration Phase 141Back end server···SQLSQLSQLC*C*C*C*C*WritecontactUUIDcontactId … contactUUID129363 123e4567-e89b-12d3…834849contacId(long) + contactUUID 42. #Cassandra @doanduyhaiMigration Phase 142Back end server···SQLSQLSQLC*C*C*C*C*Read 43. #Cassandra @doanduyhaiMigration Phase 2•  On live production, migrate old contacts43SQLSQLSQLC*C*C*C*C*For each batch of usersSELECT * FROM contactsWHERE user_id = …AND contact_uuid IS NULLOld contacts created before phase 1 44. #Cassandra @doanduyhaiMigration Phase 2•  On live production, migrate old contacts44SQLSQLSQLC*C*C*C*C*For each batch of usersSELECT * FROM contactsWHERE user_id = …AND contact_uuid IS NULLLogged batches of INSERT INTO contacts(..)VALUES(…)USING TIMESTAMPnow() - 1 weekOld contacts created before phase 1 45. #Cassandra @doanduyhaiMigration Phase 245USING TIMESTAMP now() - 1 week ???? 46. #Cassandra @doanduyhaiMigration Phase 2•  During data migration …46 47. #Cassandra @doanduyhaiMigration Phase 2•  During data migration …•  … concurrent writes from the migration batch …47 48. #Cassandra @doanduyhaiMigration Phase 2•  During data migration …•  … concurrent writes from the migration batch …•  … and updates from production for the same contact48 49. #Cassandra @doanduyhaiMigration Phase 249contact_uuidname (now -1 week) name (now)Johny … Johnny …Insert from batch(to the past)Update from production 50. #Cassandra @doanduyhaiMigration Phase 250contact_uuidname (now -1 week) name (now)Johny … Johnny …Future reads pick the most up-to-date value 51. #Cassandra @doanduyhaiLast Write Win in action51Case 1 Case 2Batchpast(Johny) t1Prodnow(Johnny) t2t3 Read(Johnny)Batchpast(Johny)t1Prodnow(Johnny)t2t3 Read(Johnny) 52. #Cassandra @doanduyhaiMigration Phase 252"Write to the Past…to save the Future"Libon – 2014/10/08 53. #Cassandra @doanduyhaiMigration Phase 353Back end server···SQLSQLSQLC*C*C*C*C*Write 54. #Cassandra @doanduyhaiMigration Phase 454Back end server···SQLSQLSQLC*C*C*C*C*Write 55. #Cassandra @doanduyhaiBusiness Code Refactoring 56. #Cassandra @doanduyhaiCode Inventory•  Written for RDBMS56 57. #Cassandra @doanduyhaiCode Inventory•  Written for RDBMS•  Lots of joins (no surprise) 57 58. #Cassandra @doanduyhaiCode Inventory•  Written for RDBMS•  Lots of joins (no surprise) •  Designed around transactions58 59. #Cassandra @doanduyhaiCode Inventory•  Written for RDBMS•  Lots of joins (no surprise) •  Designed around transactions•  Spring @Transactional everywhere59 60. #Cassandra @doanduyhaiCode Inventory cont.•  Entities go through Services & Repositories60RepositoriesServicesContactEntity 61. #Cassandra @doanduyhaiCode Inventory cont.•  Hibernate is auto-magic61 62. #Cassandra @doanduyhaiCode Inventory cont.•  Hibernate is auto-magic•  lazy loading•  1st level cache•  N+1 select62RepositoriesServicesContactEntity 63. #Cassandra @doanduyhaiWhich options ?•  Throw existing code …•  … and re-design from scratch for Cassandra63 64. #Cassandra @doanduyhaiWhich options ?•  Throw existing code …•  … and re-design from scratch for Cassandra64No way ! 65. #Cassandra @doanduyhaiCode Quality•  Existing business code has…•  … ≈ 3500 unit tests65 66. #Cassandra @doanduyhaiCode Quality•  Existing business code has…•  … ≈ 3500 unit tests•  and ≈600+ integration tests66 67. #Cassandra @doanduyhaiCode Quality67"The code coverageis one of your mostvaluable technical asset"Libon – since beginning 68. #Cassandra @doanduyhaiRepositoriesServicesRefactoring Strategy68ContactMatchingServiceContactServiceContactSyncContactEntityn1n n 69. #Cassandra @doanduyhaiRepositoriesServicesRefactoring Strategy69ContactMatchingServiceContactServiceContactNoSQLEntityContactSyncContactEntityn1n nProxy 70. #Cassandra @doanduyhaiRepositoriesServicesRefactoring Strategy70ContactMatchingServiceContactServiceContactNoSQLEntityContactSyncContactEntityn1n nDenorm2 DenormNDenorm1Proxy 71. #Cassandra @doanduyhaiRefactoring Strategy•  Use CQRS•  ContactReadRepository•  ContactWriteRepository•  ContactUpdateRepository•  ContactDeleteRepository71 72. #Cassandra @doanduyhaiRefactoring Strategy•  ContactReadRepository•  direct sequential read•  no joins•  1 read ≈ 1 SELECT72 73. #Cassandra @doanduyhaiRefactoring Strategy•  ContactWriteRepository•  write to all denormalized tables•  using CQL logged batches•  use TTLs73 74. #Cassandra @doanduyhaiRefactoring Strategy•  ContactUpdateRepository•  read-before-write most of the time ????•  rare updates ☞ acceptable perf penalty74 75. #Cassandra @doanduyhaiRefactoring Strategy•  ContactDeleteRepository•  delete by partition key75 76. #Cassandra @doanduyhaiOutcome•  5 months of 2 men work76 77. #Cassandra @doanduyhaiOutcome•  5 months of 2 men work•  Many iterations to fix bugs (thanks to IT)77 78. #Cassandra @doanduyhaiOutcome•  5 months of 2 men work•  Many iterations to fix bugs (thanks to IT)•  Lots of performance benchmarks using Gatling78 79. #Cassandra @doanduyhaiGatling Output79 80. #Cassandra @doanduyhaiOutcome•  5 months of 2 men work•  Many iterations to fix bugs (thanks to IT)•  Lots of performance benchmarks using Gatling☞ data model & code validation80 81. #Cassandra @doanduyhaiOutcome•  5 months of 2 men work•  Many iterations to fix bugs (thanks to IT)•  Lots of performance benchmarks using Gatling☞ data model & code validation•  … we are almost there for production81 82. #Cassandra @doanduyhaiData Model 83. #Cassandra @doanduyhaiDenormalization, the good•  Support fast reads•  1 read ≈ 1 SELECT•  Worthy because mostly read, few updates83 84. #Cassandra @doanduyhaiDenormalization, the bad•  Updating mutable data can be nightmare•  Data model bound by existing client-facing API•  Update paths very error-prone without tests84 85. #Cassandra @doanduyhaiData model in detail85Contacts_by_idContacts_by_identifiersContacts_in_profilesContacts_by_modification_dateContacts_by_firstname_lastnameContacts_linked_user 86. #Cassandra @doanduyhaiData model in detail86Contacts_by_idContacts_by_identifiersContacts_in_profilesContacts_by_modification_dateContacts_by_firstname_lastnameContacts_linked_useruser_id always componentof partition key 87. #Cassandra @doanduyhaiScalable design87n1n2n3n4n5n6n7n8ABCDEFGHuser_id1user_id2user_id3user_id4user_id5 88. #Cassandra @doanduyhaiScalable design88n1n2n3n4n5n6n7n8ABCDEFGHuser_id1user_id2user_id3user_id4user_id5 89. #Cassandra @doanduyhaiBloom filters in action•  For some tables, partition key = (user_id, contact_id)☞ fast look-up, leverages Bloom filters☞ touches 1 SSTable most of the time89 90. #Cassandra @doanduyhaiData model in detail90Contacts_by_idContacts_by_identifiersContacts_in_profilesContacts_by_modification_dateContacts_by_firstname_lastnameContacts_linked_userWide partition 91. #Cassandra @doanduyhaiA "queue" story•  contacts_by_modification_date•  queue-like pattern ????91 92. #Cassandra @doanduyhaiA "queue" story•  contacts_by_modification_date•  queue-like pattern ????☞ buckets to the rescue92user_id:2014-12date35 date12 … date47… …user_id:2014-11date11 date12 … date34… … 93. #Cassandra @doanduyhaiData model summary•  7 tables for denormalization93 94. #Cassandra @doanduyhaiData model summary•  7 tables for denormalization•  Normalize some tables because rare access94 95. #Cassandra @doanduyhaiData model summary•  7 tables for denormalization•  Normalize some tables because rare access•  Read-before write in most update scenarios ???? 95 96. #Cassandra @doanduyhaiNotes on contact_id•  In SQL, auto-generated long using sequence•  In Cassandra, auto-generated timeuuid 96 97. #Cassandra @doanduyhaiNotes on contact_id•  How to store both types ?97 98. #Cassandra @doanduyhaiNotes on contact_id•  How to store both types ?•  As text ? ☞ easy solution …98 99. #Cassandra @doanduyhaiNotes on contact_id•  How to store both types ?•  As text ? ☞ easy solution …•  … but waste of space !•  because encoded as UTF-8 or ASCII in Cassandra99 100. #Cassandra @doanduyhaiNotes on contact_id•  Long ☞ 8 bytes•  Long as text(UTF-8: 1 byte) ☞ "digits count" bytes100 101. #Cassandra @doanduyhaiNotes on contact_id•  UUID ☞ 16 bytesE81D4C70-A638-11E4-83CB-DEB70BF9330F•  32 hex chars + 4 hyphens = 36 chars•  UUID as text(UTF-8: 1 byte) ☞ 36 bytes•  Bytes overhead = 36 – 16 = 20 bytes101 102. #Cassandra @doanduyhaiNotes on contact_id•  20 bytes wasted per contact uuid102 103. #Cassandra @doanduyhaiNotes on contact_id•  20 bytes wasted per contact uuid•  × 7 denormalizations = 140 bytes per contact uuid103 104. #Cassandra @doanduyhaiNotes on contact_id•  20 bytes wasted per contact uuid•  × 7 denormalizations = 140 bytes per contact uuid•  × 109 contacts = 140 GB wasted104????not even counting replication factor … 105. #Cassandra @doanduyhaiNotes on contact_id•  ☞ just save contact id as byte[ ]105 106. #Cassandra @doanduyhaiNotes on contact_id•  ☞ just save contact id as byte[ ]•  Achilles @TypeTransformer for automatic conversion(see later)106 107. #Cassandra @doanduyhaiNotes on contact_id•  ☞ just save contact id as byte[ ]•  Achilles @TypeTransformer for automatic conversion(see later)•  Use blobAsBigInt( ) or blobAsUUID( ) to view data107 108. #Cassandra @doanduyhaiAchilles•  Advanced "object mapper"•  Fluent API•  Tons of features•  TDD friendly108 109. #Cassandra @doanduyhaiAchilles•  Dirty checking, what is it ?109 110. #Cassandra @doanduyhaiAchilles•  Dirty checking, what is it ?•  1 contact ≈ 8 mutable fields110 111. #Cassandra @doanduyhaiAchilles•  Dirty checking, what is it ?•  1 contact ≈ 8 mutable fields•  × 7 denormalizations = 56 update combinations …111 112. #Cassandra @doanduyhaiAchilles•  Dirty checking, what is it ?•  1 contact ≈ 8 mutable fields•  × 7 denormalizations = 56 update combinations …•  and not even counting multiple fields updates … 112 113. #Cassandra @doanduyhaiAchilles•  Are you going to manually generate 56+ preparedstatements for all possible updates ?113 114. #Cassandra @doanduyhaiAchilles•  Are you going to manually generate 56+ preparedstatements for all possible updates ?•  Or just use dynamic plain string statements and getsome perf penalty ?114 115. #Cassandra @doanduyhaiAchilles•  Dirty check in action115//No read-before-writeContactEntity proxy = manager.forUpdate(ContactEntity.class, contactId);proxy.setFirstName(…);proxy.setLastName(…); //type-safe updatesproxy.setAddress(…);manager.update(proxy); 116. #Cassandra @doanduyhaiAchilles116EmptyEntityDirtyMapProxy Setters interceptionPrimaryKey 117. #Cassandra @doanduyhaiAchilles•  Dynamic statements generation117UPDATE contacts SET firstname=?, lastname=?,address=?WHERE contact_id=?prepared statements are cached, of course 118. #Cassandra @doanduyhaiAchilles•  Insert strategy, why is it so important ?118 119. #Cassandra @doanduyhaiAchilles•  Simple INSERT prepared statement119INSERT INTO contacts(contact_id,name,age,address,gender,avatar,…) VALUES(?, ?, ?, ? … ?); 120. #Cassandra @doanduyhaiAchilles•  Runtime values binding•  some columns are optional120preparedStatement.bind(49374,’John DOE’,33, null, null, …, null); 121. #Cassandra @doanduyhaiAchilles121Wait … are you saying inserting null in CQL??????? 122. #Cassandra @doanduyhaiAchilles122Inserting null creating tombstones 123. #Cassandra @doanduyhaiAchilles123Inserting null creating tombstones× 7 denormalizations 124. #Cassandra @doanduyhaiAchilles124Inserting null creating tombstones× 7 denormalizations× billions of contacts created????not even counting replication factor … 125. #Cassandra @doanduyhaiAchilles•  Simple annotation125@Entity(table = "contacts_by_id »)@Strategy(insert = InsertStrategy.NOT_NULL_FIELDS)public class ContactById {} 126. #Cassandra @doanduyhaiAchilles•  Runtime dynamic INSERT statement126INSERT INTO contacts(contact_id, name, age, address,) VALUES(:contact_id, :name, :age, :address);prepared statements are cached, of course 127. #Cassandra @doanduyhaiAchilles•  Remember the contactId ⇄ byte[ ] conversion ?127@PartitionKey@Column(name = "contact_id")@TypeTransformer(valueCodecClass = ContactIdToBytes.class)private ContactId contactId;BYOC ☞ Bring Your Own Codec 128. #Cassandra @doanduyhaiAchilles128public interface Codec<FROM, TO> {Class<FROM> sourceType();Class<TO> targetType();TO encode(FROM fromJava)FROM decode(TO fromCassandra);} 129. #Cassandra @doanduyhaiAchilles•  Dynamic logging in action1292014-12-01 14:25:20,554 Bound statement : [INSERT INTOcontacts.contacts_by_modification_date(user_id,month_bucket,modification_date,...) VALUES(:user_id,:month_bucket,:modification_date,...) USING TTL :ttl;] with CONSISTENCY LEVEL [LOCAL_QUORUM]2014-12-01 14:25:20,554 bound values : [222130151, 2014-12, e13d0d50-7965-11e4-af38-90b11c2549e0, ...]2014-12-01 14:25:20,701 Bound statement : [SELECT birthday,middlename,avatar_size,... FROMcontacts.contacts_by_modification_date WHERE user_id=:user_id AND month_bucket=:month_bucket AND(modification_date)>=(:modification_date) ORDER BY modification_date ASC;] with CONSISTENCY LEVEL[LOCAL_QUORUM]2014-12-01 14:25:20,701 bound values : [222130151, 2014-10, be6bc010-6109-11e4-b385-000038377ead] 130. #Cassandra @doanduyhaiAchilles•  Dynamic logging•  runtime activation•  no need to recompile/re-deploy•  save us hours of debugging•  TRACE log level ☞ query tracing130 131. #Cassandra @doanduyhaiTake Away 132. #Cassandra @doanduyhaiConditions for success•  Data modeling is crucial132 133. #Cassandra @doanduyhaiConditions for success•  Data modeling is crucial•  Double-run strategy & timestamp trick FTW133 134. #Cassandra @doanduyhaiConditions for success•  Data modeling is crucial•  Double-run strategy & timestamp trick FTW•  Data type conversion can be tricky134 135. #Cassandra @doanduyhaiConditions for success•  Data modeling is crucial•  Double-run strategy & timestamp trick FTW•  Data type conversion can be tricky•  Benchmark !135 136. #Cassandra @doanduyhaiConditions for success•  Data modeling is crucial•  Double-run strategy & timestamp trick FTW•  Data type conversion can be tricky•  Benchmark !•  Mindset shifts for the team136 137. #Cassandra @doanduyhaiThank You! "" Recommended Teaching Techniques: Classroom Cloud Strategy Online Course - LinkedIn Learning Teacher Tips Online Course - LinkedIn Learning Bruce Heavin The Thinkable Presentation Online Course - LinkedIn Learning Bulk Loading Data into Cassandra DataStax Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay DataStax Academy Cassandra Community Webinar | Make Life Easier - An Introduction to Cassandra... DataStax Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar DataStax Webinar | Target Modernizes Retail with Engaging Digital Experiences DataStax How much money do you lose every time your ecommerce site goes down? DataStax Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop... DataStax About Blog Terms Privacy Copyright LinkedIn Corporation © 2017 Public clipboards featuring this slide No public clipboards found for this slide Select another clipboard × Looks like you’ve clipped this slide to already. Create a clipboard You just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
#Cassandra @doanduyhai
#Cassandra @doanduyhai
Who am I ?
2
DuyHai Doan
Achilles
Cassandra Technical Advocate @ Datastax
Former Java Developer @ L...
#Cassandra @doanduyhai
Agenda
•  Libon context
•  Migration strategy
•  Business code migration
•  Data Modeling
•  Take A...
#Cassandra @doanduyhai
Libon Context
#Cassandra @doanduyhai
What is Libon ?
•  Messaging app
•  VOIP (out)
•  Custom voicemail & greetings
•  SMS/chat/file tran...
#Cassandra @doanduyhai
Contact Matching
6
Libon User
#Cassandra @doanduyhai
Contact Matching
7
Libon User Friend
#Cassandra @doanduyhai
Contact Matching
8
Libon User Friend
Contact matching
#Cassandra @doanduyhai
Contact Matching
9
Libon User Friend
Accept link
#Cassandra @doanduyhai
Project Context
•  Application grew over the years
10
#Cassandra @doanduyhai
Project Context
•  Application grew over the years
•  Already using Cassandra to handle events
•  m...
#Cassandra @doanduyhai
Project Context
•  About contacts …
12
#Cassandra @doanduyhai
Project Context
•  About contacts …
•  stored as relational model in RDBMS (Oracle)
13
#Cassandra @doanduyhai
Project Context
•  About contacts …
•  stored as relational model in RDBMS (Oracle)
•  1 user ≈ 300...
#Cassandra @doanduyhai
Project Context
•  About contacts …
•  stored as relational model in RDBMS (Oracle)
•  1 user ≈ 300...
#Cassandra @doanduyhai
Project Context
•  About contacts …
•  stored as relational model in RDBMS (Oracle)
•  1 user ≈ 300...
#Cassandra @doanduyhai17
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
18
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
19
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
•  partitioning
20
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
•  partitioning
•  less joins, simplified relational...
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
•  partitioning
•  less joins, simplified relational...
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
•  partitioning
•  less joins, simplified relational...
#Cassandra @doanduyhai
Fixing the problem
•  Tune the RDBMS
•  indices
•  partitioning
•  less joins, simplified relational...
#Cassandra @doanduyhai
Back-end application
RDBMS Cassandra
25
#Cassandra @doanduyhai
Back-end application
RDBMS Cassandra
26
We need to
choose
#Cassandra @doanduyhai
Next Challenges
•  High Availability (DB failure, site failure …)
27
#Cassandra @doanduyhai
Next Challenges
•  High Availability (DB failure, site failure …)
•  Predictable performance at sca...
#Cassandra @doanduyhai
Next Challenges
•  High Availability (DB failure, site failure …)
•  Predictable performance at sca...
#Cassandra @doanduyhai
Next Challenges
•  High Availability (DB failure, site failure …)
•  Predictable performance at sca...
#Cassandra @doanduyhai
Data Migration Strategy
#Cassandra @doanduyhai
Objectives
•  No downtime
32
#Cassandra @doanduyhai
Objectives
•  No downtime
•  No concurrency corner-cases 
33
#Cassandra @doanduyhai
Objectives
•  No downtime
•  No concurrency corner-cases 
•  Safe rollback possible
34
#Cassandra @doanduyhai
Objectives
•  No downtime
•  No concurrency corner-cases 
•  Safe rollback possible
•  Replay-abili...
#Cassandra @doanduyhai
Strategy
•  4 phases
36
#Cassandra @doanduyhai
Strategy
•  4 phases
•  Write contacts to both data stores
37
#Cassandra @doanduyhai
Strategy
•  4 phases
•  Write contacts to both data stores
•  Old contacts migration
38
#Cassandra @doanduyhai
Strategy
•  4 phases
•  Write contacts to both data stores
•  Old contacts migration
•  Switch to C...
#Cassandra @doanduyhai
Strategy
•  4 phases
•  Write contacts to both data stores
•  Old contacts migration
•  Switch to C...
#Cassandra @doanduyhai
Migration Phase 1
41
Back end server
·
·
·
SQLSQLSQL
C*
C*
C*
C*
C*
Write
contactUUID
contactId … c...
#Cassandra @doanduyhai
Migration Phase 1
42
Back end server
·
·
·
SQLSQLSQL
C*
C*
C*
C*
C*
Read
#Cassandra @doanduyhai
Migration Phase 2
•  On live production, migrate old contacts
43
SQLSQLSQL
C*
C*
C*
C*
C*
For each ...
#Cassandra @doanduyhai
Migration Phase 2
•  On live production, migrate old contacts
44
SQLSQLSQL
C*
C*
C*
C*
C*
For each ...
#Cassandra @doanduyhai
Migration Phase 2
45
USING TIMESTAMP now() - 1 week ????
#Cassandra @doanduyhai
Migration Phase 2
•  During data migration …
46
#Cassandra @doanduyhai
Migration Phase 2
•  During data migration …
•  … concurrent writes from the migration batch …
47
#Cassandra @doanduyhai
Migration Phase 2
•  During data migration …
•  … concurrent writes from the migration batch …
•  …...
#Cassandra @doanduyhai
Migration Phase 2
49
contact_uuid
name (now -1 week)
 …
 name (now)
 …
Johny …
 Johnny …
Insert fro...
#Cassandra @doanduyhai
Migration Phase 2
50
contact_uuid
name (now -1 week)
 …
 name (now)
 …
Johny …
 Johnny …
Future rea...
#Cassandra @doanduyhai
Last Write Win in action
51
Case 1 Case 2
Batchpast(Johny)
 t1
Prodnow(Johnny)
 t2
t3
 Read(Johnny)...
#Cassandra @doanduyhai
Migration Phase 2
52
"Write to the Past…
to save the Future"
Libon – 2014/10/08
#Cassandra @doanduyhai
Migration Phase 3
53
Back end server
·
·
·
SQLSQLSQL
C*
C*
C*
C*
C*
Write
#Cassandra @doanduyhai
Migration Phase 4
54
Back end server
·
·
·
SQLSQLSQL
C*
C*
C*
C*
C*
Write
❌
#Cassandra @doanduyhai
Business Code Refactoring
#Cassandra @doanduyhai
Code Inventory
•  Written for RDBMS
56
#Cassandra @doanduyhai
Code Inventory
•  Written for RDBMS
•  Lots of joins (no surprise) 
57
#Cassandra @doanduyhai
Code Inventory
•  Written for RDBMS
•  Lots of joins (no surprise) 
•  Designed around transactions...
#Cassandra @doanduyhai
Code Inventory
•  Written for RDBMS
•  Lots of joins (no surprise) 
•  Designed around transactions...
#Cassandra @doanduyhai
Code Inventory cont.
•  Entities go through Services & Repositories
60
Repositories

Services
Conta...
#Cassandra @doanduyhai
Code Inventory cont.
•  Hibernate is auto-magic
61
#Cassandra @doanduyhai
Code Inventory cont.
•  Hibernate is auto-magic
•  lazy loading
•  1st level cache
•  N+1 select
62...
#Cassandra @doanduyhai
Which options ?
•  Throw existing code …
•  … and re-design from scratch for Cassandra

63
#Cassandra @doanduyhai
Which options ?
•  Throw existing code …
•  … and re-design from scratch for Cassandra

64
No way !
#Cassandra @doanduyhai
Code Quality
•  Existing business code has…
•  … ≈ 3500 unit tests

65
#Cassandra @doanduyhai
Code Quality
•  Existing business code has…
•  … ≈ 3500 unit tests
•  and ≈600+ integration tests
66
#Cassandra @doanduyhai
Code Quality
67
"The code coverage
is one of your most
valuable technical asset"
Libon – since begi...
#Cassandra @doanduyhai
Repositories
Services
Refactoring Strategy
68
ContactMatchingService
ContactService
ContactSync
Con...
#Cassandra @doanduyhai
Repositories
Services
Refactoring Strategy
69
ContactMatchingService
ContactService
ContactNoSQLEnt...
#Cassandra @doanduyhai
Repositories
Services
Refactoring Strategy
70
ContactMatchingService
ContactService
ContactNoSQLEnt...
#Cassandra @doanduyhai
Refactoring Strategy
•  Use CQRS
•  ContactReadRepository
•  ContactWriteRepository
•  ContactUpdat...
#Cassandra @doanduyhai
Refactoring Strategy
•  ContactReadRepository
•  direct sequential read
•  no joins
•  1 read ≈ 1 S...
#Cassandra @doanduyhai
Refactoring Strategy
•  ContactWriteRepository
•  write to all denormalized tables
•  using CQL log...
#Cassandra @doanduyhai
Refactoring Strategy
•  ContactUpdateRepository
•  read-before-write most of the time ????
•  rare upd...
#Cassandra @doanduyhai
Refactoring Strategy
•  ContactDeleteRepository
•  delete by partition key
75
#Cassandra @doanduyhai
Outcome
•  5 months of 2 men work
76
#Cassandra @doanduyhai
Outcome
•  5 months of 2 men work
•  Many iterations to fix bugs (thanks to IT)
77
#Cassandra @doanduyhai
Outcome
•  5 months of 2 men work
•  Many iterations to fix bugs (thanks to IT)
•  Lots of performan...
#Cassandra @doanduyhai
Gatling Output
79
#Cassandra @doanduyhai
Outcome
•  5 months of 2 men work
•  Many iterations to fix bugs (thanks to IT)
•  Lots of performan...
#Cassandra @doanduyhai
Outcome
•  5 months of 2 men work
•  Many iterations to fix bugs (thanks to IT)
•  Lots of performan...
#Cassandra @doanduyhai
Data Model
#Cassandra @doanduyhai
Denormalization, the good
•  Support fast reads
•  1 read ≈ 1 SELECT
•  Worthy because mostly read,...
#Cassandra @doanduyhai
Denormalization, the bad
•  Updating mutable data can be nightmare
•  Data model bound by existing ...
#Cassandra @doanduyhai
Data model in detail
85
Contacts_by_id
Contacts_by_identifiers
Contacts_in_profiles
Contacts_by_modifi...
#Cassandra @doanduyhai
Data model in detail
86
Contacts_by_id
Contacts_by_identifiers
Contacts_in_profiles
Contacts_by_modifi...
#Cassandra @doanduyhai
Scalable design
87
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user...
#Cassandra @doanduyhai
Scalable design
88
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1user_id2
user_id3
user_id4
user_...
#Cassandra @doanduyhai
Bloom filters in action
•  For some tables, partition key = (user_id, contact_id)
☞ fast look-up, l...
#Cassandra @doanduyhai
Data model in detail
90
Contacts_by_id
Contacts_by_identifiers
Contacts_in_profiles
Contacts_by_modifi...
#Cassandra @doanduyhai
A "queue" story
•  contacts_by_modification_date
•  queue-like pattern ????
91
#Cassandra @doanduyhai
A "queue" story
•  contacts_by_modification_date
•  queue-like pattern ????
☞ buckets to the rescue
92
...
#Cassandra @doanduyhai
Data model summary
•  7 tables for denormalization
93
#Cassandra @doanduyhai
Data model summary
•  7 tables for denormalization
•  Normalize some tables because rare access
94
#Cassandra @doanduyhai
Data model summary
•  7 tables for denormalization
•  Normalize some tables because rare access
•  ...
#Cassandra @doanduyhai
Notes on contact_id
•  In SQL, auto-generated long using sequence
•  In Cassandra, auto-generated t...
#Cassandra @doanduyhai
Notes on contact_id
•  How to store both types ?
97
#Cassandra @doanduyhai
Notes on contact_id
•  How to store both types ?
•  As text ? ☞ easy solution …
98
#Cassandra @doanduyhai
Notes on contact_id
•  How to store both types ?
•  As text ? ☞ easy solution …
•  … but waste of s...
#Cassandra @doanduyhai
Notes on contact_id
•  Long ☞ 8 bytes
•  Long as text(UTF-8: 1 byte) ☞ "digits count" bytes
100
#Cassandra @doanduyhai
Notes on contact_id
•  UUID ☞ 16 bytes
E81D4C70-A638-11E4-83CB-DEB70BF9330F
•  32 hex chars + 4 hyp...
#Cassandra @doanduyhai
Notes on contact_id
•  20 bytes wasted per contact uuid
102
#Cassandra @doanduyhai
Notes on contact_id
•  20 bytes wasted per contact uuid
•  × 7 denormalizations = 140 bytes per con...
#Cassandra @doanduyhai
Notes on contact_id
•  20 bytes wasted per contact uuid
•  × 7 denormalizations = 140 bytes per con...
#Cassandra @doanduyhai
Notes on contact_id
•  ☞ just save contact id as byte[ ]
105
#Cassandra @doanduyhai
Notes on contact_id
•  ☞ just save contact id as byte[ ]
•  Achilles @TypeTransformer for automatic...
#Cassandra @doanduyhai
Notes on contact_id
•  ☞ just save contact id as byte[ ]
•  Achilles @TypeTransformer for automatic...
#Cassandra @doanduyhai
Achilles
•  Advanced "object mapper"
•  Fluent API
•  Tons of features
•  TDD friendly
108
#Cassandra @doanduyhai
Achilles
•  Dirty checking, what is it ?
109
#Cassandra @doanduyhai
Achilles
•  Dirty checking, what is it ?
•  1 contact ≈ 8 mutable fields
110
#Cassandra @doanduyhai
Achilles
•  Dirty checking, what is it ?
•  1 contact ≈ 8 mutable fields
•  × 7 denormalizations = 5...
#Cassandra @doanduyhai
Achilles
•  Dirty checking, what is it ?
•  1 contact ≈ 8 mutable fields
•  × 7 denormalizations = 5...
#Cassandra @doanduyhai
Achilles
•  Are you going to manually generate 56+ prepared
statements for all possible updates ?
1...
#Cassandra @doanduyhai
Achilles
•  Are you going to manually generate 56+ prepared
statements for all possible updates ?

...
#Cassandra @doanduyhai
Achilles
•  Dirty check in action
115

//No read-before-write

ContactEntity proxy = manager.forUpd...
#Cassandra @doanduyhai
Achilles
116
Empty
Entity
DirtyMap
Proxy Setters interception
PrimaryKey
#Cassandra @doanduyhai
Achilles
•  Dynamic statements generation
117

UPDATE contacts SET firstname=?, lastname=?,address=...
#Cassandra @doanduyhai
Achilles
•  Insert strategy, why is it so important ?
118
#Cassandra @doanduyhai
Achilles
•  Simple INSERT prepared statement
119

INSERT INTO 

 
contacts(contact_id,name,age,addr...
#Cassandra @doanduyhai
Achilles
•  Runtime values binding
•  some columns are optional
120

preparedStatement.bind(49374,’...
#Cassandra @doanduyhai
Achilles
121
Wait … are you saying inserting null in CQL???
????
#Cassandra @doanduyhai
Achilles
122
Inserting null creating tombstones
#Cassandra @doanduyhai
Achilles
123
Inserting null creating tombstones
× 7 denormalizations
#Cassandra @doanduyhai
Achilles
124
Inserting null creating tombstones
× 7 denormalizations
× billions of contacts created...
#Cassandra @doanduyhai
Achilles
•  Simple annotation
125

@Entity(table = "contacts_by_id »)

@Strategy(insert = InsertStr...
#Cassandra @doanduyhai
Achilles
•  Runtime dynamic INSERT statement
126

INSERT INTO 

 
contacts(contact_id, name, age, a...
#Cassandra @doanduyhai
Achilles
•  Remember the contactId ⇄ byte[ ] conversion ?
127
@PartitionKey
@Column(name = "contact...
#Cassandra @doanduyhai
Achilles
128
public interface Codec<FROM, TO> {

Class<FROM> sourceType();

Class<TO> targetType();...
#Cassandra @doanduyhai
Achilles
•  Dynamic logging in action
129

2014-12-01 14:25:20,554 Bound statement : [INSERT INTO
c...
#Cassandra @doanduyhai
Achilles
•  Dynamic logging
•  runtime activation
•  no need to recompile/re-deploy
•  save us hour...
#Cassandra @doanduyhai
Take Away
#Cassandra @doanduyhai
Conditions for success
•  Data modeling is crucial
132
#Cassandra @doanduyhai
Conditions for success
•  Data modeling is crucial
•  Double-run strategy & timestamp trick FTW
133
#Cassandra @doanduyhai
Conditions for success
•  Data modeling is crucial
•  Double-run strategy & timestamp trick FTW
•  ...
#Cassandra @doanduyhai
Conditions for success
•  Data modeling is crucial
•  Double-run strategy & timestamp trick FTW
•  ...
#Cassandra @doanduyhai
Conditions for success
•  Data modeling is crucial
•  Double-run strategy & timestamp trick FTW
•  ...
#Cassandra @doanduyhai
Thank You
! ""

Upcoming SlideShare

Loading in …5

×

  1. 1. Migration Best Practices: From RDBMS to Cassandra without a Hitch #Cassandra @doanduyhai
  2. 2. #Cassandra @doanduyhai Who am I ? 2 DuyHai Doan Achilles Cassandra Technical Advocate @ Datastax Former Java Developer @ Libon
  3. 3. #Cassandra @doanduyhai Agenda •  Libon context •  Migration strategy •  Business code migration •  Data Modeling •  Take Away 3
  4. 4. #Cassandra @doanduyhai Libon Context
  5. 5. #Cassandra @doanduyhai What is Libon ? •  Messaging app •  VOIP (out) •  Custom voicemail & greetings •  SMS/chat/file transfer •  Contacts matching 5
  6. 6. #Cassandra @doanduyhai Contact Matching 6 Libon User
  7. 7. #Cassandra @doanduyhai Contact Matching 7 Libon User Friend
  8. 8. #Cassandra @doanduyhai Contact Matching 8 Libon User Friend Contact matching
  9. 9. #Cassandra @doanduyhai Contact Matching 9 Libon User Friend Accept link
  10. 10. #Cassandra @doanduyhai Project Context •  Application grew over the years 10
  11. 11. #Cassandra @doanduyhai Project Context •  Application grew over the years •  Already using Cassandra to handle events •  messaging / file sharing / SMS / notifications •  Cassandra R/W latencies ≈ 0,4 ms •  server response time under 10 ms 11
  12. 12. #Cassandra @doanduyhai Project Context •  About contacts … 12
  13. 13. #Cassandra @doanduyhai Project Context •  About contacts … •  stored as relational model in RDBMS (Oracle) 13
  14. 14. #Cassandra @doanduyhai Project Context •  About contacts … •  stored as relational model in RDBMS (Oracle) •  1 user ≈ 300 contacts 14
  15. 15. #Cassandra @doanduyhai Project Context •  About contacts … •  stored as relational model in RDBMS (Oracle) •  1 user ≈ 300 contacts •  with millions users ☞ billions of contacts to handle 15
  16. 16. #Cassandra @doanduyhai Project Context •  About contacts … •  stored as relational model in RDBMS (Oracle) •  1 user ≈ 300 contacts •  with millions users ☞ billions of contacts to handle •  query latency unpredictable 16
  17. 17. #Cassandra @doanduyhai17
  18. 18. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS 18
  19. 19. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices 19
  20. 20. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices •  partitioning 20
  21. 21. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices •  partitioning •  less joins, simplified relational model 21
  22. 22. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices •  partitioning •  less joins, simplified relational model •  hardware capacity increased 22
  23. 23. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices •  partitioning •  less joins, simplified relational model •  hardware capacity increased 23 That worked
  24. 24. #Cassandra @doanduyhai Fixing the problem •  Tune the RDBMS •  indices •  partitioning •  less joins, simplified relational model •  hardware capacity increased 24 That worked but …
  25. 25. #Cassandra @doanduyhai Back-end application RDBMS Cassandra 25
  26. 26. #Cassandra @doanduyhai Back-end application RDBMS Cassandra 26 We need to choose
  27. 27. #Cassandra @doanduyhai Next Challenges •  High Availability (DB failure, site failure …) 27
  28. 28. #Cassandra @doanduyhai Next Challenges •  High Availability (DB failure, site failure …) •  Predictable performance at scale 28
  29. 29. #Cassandra @doanduyhai Next Challenges •  High Availability (DB failure, site failure …) •  Predictable performance at scale •  Going to multi data-centers 29
  30. 30. #Cassandra @doanduyhai Next Challenges •  High Availability (DB failure, site failure …) •  Predictable performance at scale •  Going to multi data-centers ☞ Cassandra, what else ? 30
  31. 31. #Cassandra @doanduyhai Data Migration Strategy
  32. 32. #Cassandra @doanduyhai Objectives •  No downtime 32
  33. 33. #Cassandra @doanduyhai Objectives •  No downtime •  No concurrency corner-cases 33
  34. 34. #Cassandra @doanduyhai Objectives •  No downtime •  No concurrency corner-cases •  Safe rollback possible 34
  35. 35. #Cassandra @doanduyhai Objectives •  No downtime •  No concurrency corner-cases •  Safe rollback possible •  Replay-ability & resume-ability 35
  36. 36. #Cassandra @doanduyhai Strategy •  4 phases 36
  37. 37. #Cassandra @doanduyhai Strategy •  4 phases •  Write contacts to both data stores 37
  38. 38. #Cassandra @doanduyhai Strategy •  4 phases •  Write contacts to both data stores •  Old contacts migration 38
  39. 39. #Cassandra @doanduyhai Strategy •  4 phases •  Write contacts to both data stores •  Old contacts migration •  Switch to Cassandra (but keep RDBMS in case of…) 39
  40. 40. #Cassandra @doanduyhai Strategy •  4 phases •  Write contacts to both data stores •  Old contacts migration •  Switch to Cassandra (but keep RDBMS in case of…) •  Remove the RDBMS code 40
  41. 41. #Cassandra @doanduyhai Migration Phase 1 41 Back end server · · · SQLSQLSQL C* C* C* C* C* Write contactUUID contactId … contactUUID 129363 123e4567- e89b-12d3… 834849 contacId(long) + contactUUID
  42. 42. #Cassandra @doanduyhai Migration Phase 1 42 Back end server · · · SQLSQLSQL C* C* C* C* C* Read
  43. 43. #Cassandra @doanduyhai Migration Phase 2 •  On live production, migrate old contacts 43 SQLSQLSQL C* C* C* C* C* For each batch of users SELECT * FROM contacts WHERE user_id = … AND contact_uuid IS NULL Old contacts created before phase 1
  44. 44. #Cassandra @doanduyhai Migration Phase 2 •  On live production, migrate old contacts 44 SQLSQLSQL C* C* C* C* C* For each batch of users SELECT * FROM contacts WHERE user_id = … AND contact_uuid IS NULL Logged batches of INSERT INTO contacts(..) VALUES(…) USING TIMESTAMP now() - 1 week Old contacts created before phase 1
  45. 45. #Cassandra @doanduyhai Migration Phase 2 45 USING TIMESTAMP now() - 1 week ????
  46. 46. #Cassandra @doanduyhai Migration Phase 2 •  During data migration … 46
  47. 47. #Cassandra @doanduyhai Migration Phase 2 •  During data migration … •  … concurrent writes from the migration batch … 47
  48. 48. #Cassandra @doanduyhai Migration Phase 2 •  During data migration … •  … concurrent writes from the migration batch … •  … and updates from production for the same contact 48
  49. 49. #Cassandra @doanduyhai Migration Phase 2 49 contact_uuid name (now -1 week) … name (now) … Johny … Johnny … Insert from batch (to the past) Update from production
  50. 50. #Cassandra @doanduyhai Migration Phase 2 50 contact_uuid name (now -1 week) … name (now) … Johny … Johnny … Future reads pick the most up-to-date value
  51. 51. #Cassandra @doanduyhai Last Write Win in action 51 Case 1 Case 2 Batchpast(Johny) t1 Prodnow(Johnny) t2 t3 Read(Johnny) Batchpast(Johny) t1 Prodnow(Johnny) t2 t3 Read(Johnny)
  52. 52. #Cassandra @doanduyhai Migration Phase 2 52 "Write to the Past… to save the Future" Libon – 2014/10/08
  53. 53. #Cassandra @doanduyhai Migration Phase 3 53 Back end server · · · SQLSQLSQL C* C* C* C* C* Write
  54. 54. #Cassandra @doanduyhai Migration Phase 4 54 Back end server · · · SQLSQLSQL C* C* C* C* C* Write ❌
  55. 55. #Cassandra @doanduyhai Business Code Refactoring
  56. 56. #Cassandra @doanduyhai Code Inventory •  Written for RDBMS 56
  57. 57. #Cassandra @doanduyhai Code Inventory •  Written for RDBMS •  Lots of joins (no surprise) 57
  58. 58. #Cassandra @doanduyhai Code Inventory •  Written for RDBMS •  Lots of joins (no surprise) •  Designed around transactions 58
  59. 59. #Cassandra @doanduyhai Code Inventory •  Written for RDBMS •  Lots of joins (no surprise) •  Designed around transactions •  Spring @Transactional everywhere 59
  60. 60. #Cassandra @doanduyhai Code Inventory cont. •  Entities go through Services & Repositories 60 Repositories Services ContactEntity
  61. 61. #Cassandra @doanduyhai Code Inventory cont. •  Hibernate is auto-magic 61
  62. 62. #Cassandra @doanduyhai Code Inventory cont. •  Hibernate is auto-magic •  lazy loading •  1st level cache •  N+1 select 62 Repositories Services ContactEntity
  63. 63. #Cassandra @doanduyhai Which options ? •  Throw existing code … •  … and re-design from scratch for Cassandra 63
  64. 64. #Cassandra @doanduyhai Which options ? •  Throw existing code … •  … and re-design from scratch for Cassandra 64 No way !
  65. 65. #Cassandra @doanduyhai Code Quality •  Existing business code has… •  … ≈ 3500 unit tests 65
  66. 66. #Cassandra @doanduyhai Code Quality •  Existing business code has… •  … ≈ 3500 unit tests •  and ≈600+ integration tests 66
  67. 67. #Cassandra @doanduyhai Code Quality 67 "The code coverage is one of your most valuable technical asset" Libon – since beginning
  68. 68. #Cassandra @doanduyhai Repositories Services Refactoring Strategy 68 ContactMatchingService ContactService ContactSync ContactEntity n 1 n n
  69. 69. #Cassandra @doanduyhai Repositories Services Refactoring Strategy 69 ContactMatchingService ContactService ContactNoSQLEntity ContactSync ContactEntity n 1 n n Proxy
  70. 70. #Cassandra @doanduyhai Repositories Services Refactoring Strategy 70 ContactMatchingService ContactService ContactNoSQLEntity ContactSync ContactEntity n 1 n n Denorm2 … DenormN Denorm1 Proxy
  71. 71. #Cassandra @doanduyhai Refactoring Strategy •  Use CQRS •  ContactReadRepository •  ContactWriteRepository •  ContactUpdateRepository •  ContactDeleteRepository 71
  72. 72. #Cassandra @doanduyhai Refactoring Strategy •  ContactReadRepository •  direct sequential read •  no joins •  1 read ≈ 1 SELECT 72
  73. 73. #Cassandra @doanduyhai Refactoring Strategy •  ContactWriteRepository •  write to all denormalized tables •  using CQL logged batches •  use TTLs 73
  74. 74. #Cassandra @doanduyhai Refactoring Strategy •  ContactUpdateRepository •  read-before-write most of the time ???? •  rare updates ☞ acceptable perf penalty 74
  75. 75. #Cassandra @doanduyhai Refactoring Strategy •  ContactDeleteRepository •  delete by partition key 75
  76. 76. #Cassandra @doanduyhai Outcome •  5 months of 2 men work 76
  77. 77. #Cassandra @doanduyhai Outcome •  5 months of 2 men work •  Many iterations to fix bugs (thanks to IT) 77
  78. 78. #Cassandra @doanduyhai Outcome •  5 months of 2 men work •  Many iterations to fix bugs (thanks to IT) •  Lots of performance benchmarks using Gatling 78
  79. 79. #Cassandra @doanduyhai Gatling Output 79
  80. 80. #Cassandra @doanduyhai Outcome •  5 months of 2 men work •  Many iterations to fix bugs (thanks to IT) •  Lots of performance benchmarks using Gatling ☞ data model & code validation 80
  81. 81. #Cassandra @doanduyhai Outcome •  5 months of 2 men work •  Many iterations to fix bugs (thanks to IT) •  Lots of performance benchmarks using Gatling ☞ data model & code validation •  … we are almost there for production 81
  82. 82. #Cassandra @doanduyhai Data Model
  83. 83. #Cassandra @doanduyhai Denormalization, the good •  Support fast reads •  1 read ≈ 1 SELECT •  Worthy because mostly read, few updates 83
  84. 84. #Cassandra @doanduyhai Denormalization, the bad •  Updating mutable data can be nightmare •  Data model bound by existing client-facing API •  Update paths very error-prone without tests 84
  85. 85. #Cassandra @doanduyhai Data model in detail 85 Contacts_by_id Contacts_by_identifiers Contacts_in_profiles Contacts_by_modification_date Contacts_by_firstname_lastname Contacts_linked_user
  86. 86. #Cassandra @doanduyhai Data model in detail 86 Contacts_by_id Contacts_by_identifiers Contacts_in_profiles Contacts_by_modification_date Contacts_by_firstname_lastname Contacts_linked_user user_id always component of partition key
  87. 87. #Cassandra @doanduyhai Scalable design 87 n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H user_id1 user_id2 user_id3 user_id4 user_id5
  88. 88. #Cassandra @doanduyhai Scalable design 88 n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H user_id1user_id2 user_id3 user_id4 user_id5
  89. 89. #Cassandra @doanduyhai Bloom filters in action •  For some tables, partition key = (user_id, contact_id) ☞ fast look-up, leverages Bloom filters ☞ touches 1 SSTable most of the time 89
  90. 90. #Cassandra @doanduyhai Data model in detail 90 Contacts_by_id Contacts_by_identifiers Contacts_in_profiles Contacts_by_modification_date Contacts_by_firstname_lastname Contacts_linked_user Wide partition
  91. 91. #Cassandra @doanduyhai A "queue" story •  contacts_by_modification_date •  queue-like pattern ???? 91
  92. 92. #Cassandra @doanduyhai A "queue" story •  contacts_by_modification_date •  queue-like pattern ???? ☞ buckets to the rescue 92 user_id:2014-12 date35 date12 … … date47 … … … … user_id:2014-11 date11 date12 … … date34 … … … …
  93. 93. #Cassandra @doanduyhai Data model summary •  7 tables for denormalization 93
  94. 94. #Cassandra @doanduyhai Data model summary •  7 tables for denormalization •  Normalize some tables because rare access 94
  95. 95. #Cassandra @doanduyhai Data model summary •  7 tables for denormalization •  Normalize some tables because rare access •  Read-before write in most update scenarios ???? 95
  96. 96. #Cassandra @doanduyhai Notes on contact_id •  In SQL, auto-generated long using sequence •  In Cassandra, auto-generated timeuuid 96
  97. 97. #Cassandra @doanduyhai Notes on contact_id •  How to store both types ? 97
  98. 98. #Cassandra @doanduyhai Notes on contact_id •  How to store both types ? •  As text ? ☞ easy solution … 98
  99. 99. #Cassandra @doanduyhai Notes on contact_id •  How to store both types ? •  As text ? ☞ easy solution … •  … but waste of space ! •  because encoded as UTF-8 or ASCII in Cassandra 99
  100. 100. #Cassandra @doanduyhai Notes on contact_id •  Long ☞ 8 bytes •  Long as text(UTF-8: 1 byte) ☞ "digits count" bytes 100
  101. 101. #Cassandra @doanduyhai Notes on contact_id •  UUID ☞ 16 bytes E81D4C70-A638-11E4-83CB-DEB70BF9330F •  32 hex chars + 4 hyphens = 36 chars •  UUID as text(UTF-8: 1 byte) ☞ 36 bytes •  Bytes overhead = 36 – 16 = 20 bytes 101
  102. 102. #Cassandra @doanduyhai Notes on contact_id •  20 bytes wasted per contact uuid 102
  103. 103. #Cassandra @doanduyhai Notes on contact_id •  20 bytes wasted per contact uuid •  × 7 denormalizations = 140 bytes per contact uuid 103
  104. 104. #Cassandra @doanduyhai Notes on contact_id •  20 bytes wasted per contact uuid •  × 7 denormalizations = 140 bytes per contact uuid •  × 109 contacts = 140 GB wasted 104 ???? not even counting replication factor …
  105. 105. #Cassandra @doanduyhai Notes on contact_id •  ☞ just save contact id as byte[ ] 105
  106. 106. #Cassandra @doanduyhai Notes on contact_id •  ☞ just save contact id as byte[ ] •  Achilles @TypeTransformer for automatic conversion (see later) 106
  107. 107. #Cassandra @doanduyhai Notes on contact_id •  ☞ just save contact id as byte[ ] •  Achilles @TypeTransformer for automatic conversion (see later) •  Use blobAsBigInt( ) or blobAsUUID( ) to view data 107
  108. 108. #Cassandra @doanduyhai Achilles •  Advanced "object mapper" •  Fluent API •  Tons of features •  TDD friendly 108
  109. 109. #Cassandra @doanduyhai Achilles •  Dirty checking, what is it ? 109
  110. 110. #Cassandra @doanduyhai Achilles •  Dirty checking, what is it ? •  1 contact ≈ 8 mutable fields 110
  111. 111. #Cassandra @doanduyhai Achilles •  Dirty checking, what is it ? •  1 contact ≈ 8 mutable fields •  × 7 denormalizations = 56 update combinations … 111
  112. 112. #Cassandra @doanduyhai Achilles •  Dirty checking, what is it ? •  1 contact ≈ 8 mutable fields •  × 7 denormalizations = 56 update combinations … •  and not even counting multiple fields updates … 112
  113. 113. #Cassandra @doanduyhai Achilles •  Are you going to manually generate 56+ prepared statements for all possible updates ? 113
  114. 114. #Cassandra @doanduyhai Achilles •  Are you going to manually generate 56+ prepared statements for all possible updates ? •  Or just use dynamic plain string statements and get some perf penalty ? 114
  115. 115. #Cassandra @doanduyhai Achilles •  Dirty check in action 115 //No read-before-write ContactEntity proxy = manager.forUpdate(ContactEntity.class, contactId); proxy.setFirstName(…); proxy.setLastName(…); //type-safe updates proxy.setAddress(…); manager.update(proxy);
  116. 116. #Cassandra @doanduyhai Achilles 116 Empty Entity DirtyMap Proxy Setters interception PrimaryKey
  117. 117. #Cassandra @doanduyhai Achilles •  Dynamic statements generation 117 UPDATE contacts SET firstname=?, lastname=?,address=? WHERE contact_id=? prepared statements are cached, of course
  118. 118. #Cassandra @doanduyhai Achilles •  Insert strategy, why is it so important ? 118
  119. 119. #Cassandra @doanduyhai Achilles •  Simple INSERT prepared statement 119 INSERT INTO contacts(contact_id,name,age,address,gender,avatar,…) VALUES(?, ?, ?, ? … ?);
  120. 120. #Cassandra @doanduyhai Achilles •  Runtime values binding •  some columns are optional 120 preparedStatement.bind(49374,’John DOE’,33, null, null, …, null);
  121. 121. #Cassandra @doanduyhai Achilles 121 Wait … are you saying inserting null in CQL??? ????
  122. 122. #Cassandra @doanduyhai Achilles 122 Inserting null creating tombstones
  123. 123. #Cassandra @doanduyhai Achilles 123 Inserting null creating tombstones × 7 denormalizations
  124. 124. #Cassandra @doanduyhai Achilles 124 Inserting null creating tombstones × 7 denormalizations × billions of contacts created ???? not even counting replication factor …
  125. 125. #Cassandra @doanduyhai Achilles •  Simple annotation 125 @Entity(table = "contacts_by_id ») @Strategy(insert = InsertStrategy.NOT_NULL_FIELDS) public class ContactById { }
  126. 126. #Cassandra @doanduyhai Achilles •  Runtime dynamic INSERT statement 126 INSERT INTO contacts(contact_id, name, age, address,) VALUES(:contact_id, :name, :age, :address); prepared statements are cached, of course
  127. 127. #Cassandra @doanduyhai Achilles •  Remember the contactId ⇄ byte[ ] conversion ? 127 @PartitionKey @Column(name = "contact_id") @TypeTransformer(valueCodecClass = ContactIdToBytes.class) private ContactId contactId; BYOC ☞ Bring Your Own Codec
  128. 128. #Cassandra @doanduyhai Achilles 128 public interface Codec<FROM, TO> { Class<FROM> sourceType(); Class<TO> targetType(); TO encode(FROM fromJava) FROM decode(TO fromCassandra); }
  129. 129. #Cassandra @doanduyhai Achilles •  Dynamic logging in action 129 2014-12-01 14:25:20,554 Bound statement : [INSERT INTO contacts.contacts_by_modification_date(user_id,month_bucket,modification_date,...) VALUES (:user_id,:month_bucket,:modification_date,...) USING TTL :ttl;] with CONSISTENCY LEVEL [LOCAL_QUORUM] 2014-12-01 14:25:20,554 bound values : [222130151, 2014-12, e13d0d50-7965-11e4-af38-90b11c2549e0, ...] 2014-12-01 14:25:20,701 Bound statement : [SELECT birthday,middlename,avatar_size,... FROM contacts.contacts_by_modification_date WHERE user_id=:user_id AND month_bucket=:month_bucket AND (modification_date)>=(:modification_date) ORDER BY modification_date ASC;] with CONSISTENCY LEVEL [LOCAL_QUORUM] 2014-12-01 14:25:20,701 bound values : [222130151, 2014-10, be6bc010-6109-11e4-b385-000038377ead]
  130. 130. #Cassandra @doanduyhai Achilles •  Dynamic logging •  runtime activation •  no need to recompile/re-deploy •  save us hours of debugging •  TRACE log level ☞ query tracing 130
  131. 131. #Cassandra @doanduyhai Take Away
  132. 132. #Cassandra @doanduyhai Conditions for success •  Data modeling is crucial 132
  133. 133. #Cassandra @doanduyhai Conditions for success •  Data modeling is crucial •  Double-run strategy & timestamp trick FTW 133
  134. 134. #Cassandra @doanduyhai Conditions for success •  Data modeling is crucial •  Double-run strategy & timestamp trick FTW •  Data type conversion can be tricky 134
  135. 135. #Cassandra @doanduyhai Conditions for success •  Data modeling is crucial •  Double-run strategy & timestamp trick FTW •  Data type conversion can be tricky •  Benchmark ! 135
  136. 136. #Cassandra @doanduyhai Conditions for success •  Data modeling is crucial •  Double-run strategy & timestamp trick FTW •  Data type conversion can be tricky •  Benchmark ! •  Mindset shifts for the team 136
  137. 137. #Cassandra @doanduyhai Thank You ! ""

Related Articles

windows
cassandra
tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

Vladimir Kaplarevic

12/2/2023

cassandra
tutorial

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra