Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

3/27/2019

Reading time:9 min

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng,…

by DataStax

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng,… SlideShare Explore You Successfully reported this slideshow.An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian) | C* Summit 2016Upcoming SlideShareLoading in …5× 0 Comments 0 Likes Statistics Notes Be the first to like this No DownloadsNo notes for slide 1. Yabin MengAn Effective Approach to Migrate Thrift to CQL 2. About Pythian• Founded in 1997, a global leader in Enterprise IT transformation and operational excellence•www.pythian.com•https://www.linkedin.com/company/pythian• Love your Data•6000+ database under management•Over 400 DBA’s, in 35 countries•Top 5% of DBA workforce – 9 Oracle ACE’s, 2 Microsoft MVP’s, 1 Cassandra MVP•Partners with Amazon, Cloudera, Oracle, Microsoft, DataStax, and a lot more• Specialities: Databases, Cloud, DevOps, Big Data Infrastructure, Advanced Analytics, Infrastructure Management• Industry Recognition•DBTA1100 2016 - The Companies That Matter Most in Data•IAOP2- GLOBAL OUTSOURCING 100® RISING STAR FOR 2016•Top 25 Canadian ICT3Professional Services Companies•… …DBTA1: Database Trend and Applications (http://www.dbta.com/)IAOP2: International Association of Outsourcing Professionals (https://www.iaop.org/)ICT3: Information & Communication Technology© Pythian, All Rights Reserved. 2 3. About Myself• 15+ years IT experience - relational database/data warehousing, business intelligence andanalytics (BIA), NoSQL database/Big Data• Started my career as a RDBMS client application developer (PowerBuilder, Java)• 5 years as IBM DB2 Kernel Developer, New Development (DB2 v9.5, 9.7, 9.8, 10)• Multi-years data warehousing and BIA experience in insurance and capital market sectors astechnical lead/architect (IBM/Cognos, SAS/SAS 9.x, Information Builders/WebFocus)• DataStax Certified Developer, Administrator, and Partner Architect• https://www.linkedin.com/in/yabinmeng• @yabinmeng© Pythian, All Rights Reserved. 3 4. 1 Problem Overview2 Background Discussion3 Thrift to CQL Transition4 C* Storage Engine (pre 3.0)5 Data Migration6 Conclusion4© Pythian, All Rights Reserved. 5. Problem Overview• C* cluster upgrade from version 1.2 to 2.1• Thrift based application (Hector)•Plan to convert to CQL, but not done yet• Data copied from one C* cluster to another•C* COPY command• Application failed in DEV/TEST•Cannot retrieve some required data•cqlsh output shows NO difference• Schema• Data query result© Pythian, All Rights Reserved. 5 6. Background Disccsion, COPY commands• COPY commands•Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> WITH options•Since 1.1.3 (PostgresSQL COPY commands)•Simple yet with limitation. Continuous improvement•Small datasets (max. a few million rows)•CSV format•No primary key check•New options and better performance in cqlsh copy– 2.1.13, 2.2.5, 3.0.3, 3.2– ex (common): NUMPROCESSES, CONFIGFILE– ex (TO): BEGINTOKEN / ENDTOKEN– ex (FROM): INGESTRATE (rows/second)•Predefined schema© Pythian, All Rights Reserved. 6 7. Background Discussion, Schema-less?• Schema-less or Schema•Many NoSQL DBs is marked as “schema-less”• A lot of debate•C* is schema optional (Thrift)•C* is schema required (CQL)© Pythian, All Rights Reserved. 7 8. Background Discussion, C* Utilities•cassandra-cli utility•Command line interface since very early years of C*•Thrift-based counterpart of “cqlsh”•C* data engine format (pre 3.0)•Deprecated since C* 2.1 and will be removed in 3.0•Comparator and Validator•Validator: row key, column value•Comparator: column name + column sorting order•sstable2json utility•Deprecated and will be removed in 3.0SET users['key_val']['column1_name']='some value';SET users['key_val']['column2_name']='some value';...LIST users;GET users['key_val']['column1_name'] 9. Thrift to CQL Transition• What is Thrift and CQL?•Thrift is a general framework (interface definition language and binary communication protocol)• "scalable cross-language services development"•APIs to interact with C*•CQL is also a query language© Pythian, All Rights Reserved. 9Thrift ThriftCQLBinaryCQL querylanguageCQL querylanguageN/A *C* evolvementProtocolLayerAccessAbstractionLayerC* Data EngineStorageLayer 10. Thrift to CQL Transition, continued• Thrift in C*•Since the beginning as a communication protocol•Low level API•Deprecated• CQL in C*•0.8, CQL as a language•1.2, CQL as a language, version 3 (CQL3)•1.2, CQL as a communication protocol (CQL3 only)•Continuously evolving• Thrift vs. CQL as protocol•Thrift: RPC based; Synchronous•CQL: Server notification (event registration); Asynchronous• CQL as a language•Feature rich: UDF, lightweight transaction, JSON support, materialized view, …• Thrift vs. CQL as client driver•Thrift driver is deprecated and not supported© Pythian, All Rights Reserved. 10 11. Thrift to CQL Transition, continued• What is needed for the transition?•Application•Data Model•Data itself•Understanding C* storage engine ** Pre-3.0 C*© Pythian, All Rights Reserved. 11 12. C* Storage Engine (pre 3.0)• C* ColumnFamily (Thrift) / Table (CQL) data structure•Map of sorted map:Map<RowKey, SortedMap<ColumnKey, ColumnValue>>CREATE TABLE song_tags (id uuid,tag_name text,PRIMARY KEY (id, tag_name)) WITH COMPACT STORAGE© Pythian, All Rights Reserved. 12 13. C* Storage Engine – Thrift vs. CQL Terminology• Table–Thrift: Column Family–CQL: Table• Row–Thrift: Storage Row / Partition–CQL: Logic Representation / Part of apartition• Column–Thrift: Cell (column key + column value)–CQL: Metadata / column key or columnvalue© Pythian, All Rights Reserved. 13 14. C* Storage Engine – Compact StorageCREATE TABLE avg_grade2 (student_id int,class_id int,grade double,PRIMARY KEY (student_id, class_id)) WITH COMPACT STORAGE© Pythian, All Rights Reserved. 14• Storage space vs. table metadata– Backward compatibility– Thrift: always compact– CQL: optional• Caveats:– No adding / dropping column– Single non-key column if compound primary key– Non-frozen collection type 15. C* Storage Engine – Static Table Definitioncreate column family avg_gradewith key_validation_class = Int32Typeand comparator = UTF8Typeand column_metadata = [{column_name: class_id, validation_class: Int32Type},{column_name: grade, validation_class: FloatType}]create table avg_grade (key int primary key,class_id int,grade float) with compact storage•Equivalent definition•Thrift: no name for key -- default name used in CQL:“key”© Pythian, All Rights Reserved. 15Thrift / cassandra_cli 16. C* Storage Engine – Dynamic Table DefinitionThrift / cassandra_clicreate column family sensor_datawith key_validation_class = Int32Typeand comparator = TimeUUIDTypeand default_validation_class = DecimalType;-------------------RowKey: 1=> (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000)=> (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000)-------------------RowKey: 2=> (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000)• Thrift only• No “column_metadata” section© Pythian, All Rights Reserved. 16CREATE TABLE sensor_data (key int,column1 timeuuid,value decimal,PRIMARY KEY (key, column1)) WITH COMPACT STORAGE•Default “key” for key name•Default “column1” for column name/key•Default “value” for column value•Alter Table 17. C* Storage Engine – Mixed Table DefinitionThrift / cassandra_clicreate column family blogswith key_validation_class = Int32Typeand comparator = UTF8Typeand column_metadata = [{column_name: author, validation_class: UTF8Type}]© Pythian, All Rights Reserved. 17CREATE TABLE blogs (key int PRIMARY KEY,author text) WITH COMPACT STORAGEkey | author-----+--------1 | Donald2 | John-------------------RowKey: 1=> (name=author, value=Donald, timestamp=1462720696717000)=> (name=tags:category, value=music, timestamp=1462720526053000)=> (name=tags:rating, value=A, timestamp=1462720564590000)=> (name=tags:recommend, value=Yes, timestamp=1462720640817000)-------------------RowKey: 2=> (name=author, value=John, timestamp=1462720690911000) 18. Migrate Thrift Data – Overview• Do not fully trust C* schema•“show schema” / “describe keyspace”• Sample your data•“cassandra-cli”, yes•“cqlsh”, no• Review application design and code•Schema-less / schema -optional• Determine CQL schema• Data Migration Method© Pythian, All Rights Reserved. 18 19. Migrate Thrift Data – Determine CQL SchemaThrift / cassandra-clicreate column family some_tbl_namewith key_validation_class = BytesTypeand comparator = UTF8Typeand column_metadata = [{column_name: StaticColName1, validation_class: xxx1},{column_name: StaticColName2, validation_class: xxx2}{column_name: StaticColName3, validation_class: xxx3}… …]StaticColNameSet = {StaticColName1, StaticColName2, StaticColName3, …}DynamicColNameSet = {DynamicColName1, DynmaicColName2, DyanmicColName3, …}CQL / cqlsh schema• Application specific• Could it be a little bit more generic?CREATE TABLE some_tbl_name (key blob,StaticColName1 yyy1,StaticColName2 yyy2,StaticColName3 yyy3,… …,DynamicColMapName map<text, blob>;PRIMARY KEY (key))© Pythian, All Rights Reserved. 19 20. Migrate Thrift Data - ETL Method• ETL•COPY command•Commercial product (jasper soft, pentaho, …)•Self developed• Natural Approach•Read every record from the source table and write it into the target table•Slow• C* load utility•sstableloader© Pythian, All Rights Reserved. 20 21. Migrate Thrift Data – Methods, continuedwriter = CQLSSTableWriter.build()thrift_rows = thirft_conn.fetchRows()for each row in thrift_rows() {keyValue = row.getKey()dynColMap = new Map()thirft_columns = row.getColumns()for each column in thirft_columns {colName = column.getName()if colName is in DynamicColNameSet {dynColMap.addKeyValuePair(column.name, column.value)}else {staticColVal1 = column.getValue(StaticColName1)staticColVal2 = column.getValue(StaticColName2)… … … } } }writer.addRow(keyValue, staticColVal1, staticColVal2, …, dynColMap)}writer.close()• Read source data• Thrift API• Writing SSTables Directly• SSTableSimpleUnsortedWriter• CQLSSTableWriter 22. Migrate Thrift Data - ResultFetching result from source table and generate SSTable ...-------------------------> batch_run: 1; batch_size: 20001; batch_runtime: PT5.594S> batch_run: 2; batch_size: 20000; batch_runtime: PT4.368S> batch_run: 3; batch_size: 20000; batch_runtime: PT3.937S> batch_run: 4; batch_size: 20000; batch_runtime: PT3.838S> batch_run: 5; batch_size: 20000; batch_runtime: PT2.015S… … …> batch_run: 59; batch_size: 20000; batch_runtime: PT1.941S> batch_run: 60; batch_size: 12581; batch_runtime: PT2.82S#### Total Records Processed: 1192582; Total Run Time: PT2M35.178S$ sstableloader -d <node_ip><sstable_root>/<keyspace_name>/<target_cql_table_name>/… … …Summary statistics:Connections per host: : 1Total files transferred: : 15Total bytes transferred: : 511566477Total duration (ms): : 47451Average transfer rate (MB/s): : 10Peak transfer rate (MB/s): : 20© Pythian, All Rights Reserved. 22 23. Conclusion• C* is moving from Thrift to CQL–Do it !• C* schema may not tell you all• Effective C* data migration is not easy•C* does have “load” utility (framework)• C* storage engine changes a lot in 3.0• Pythian Blog Space:•https://www.pythian.com/blog/•An Effective Approach to Migrate Dynamic Thrift Data to CQL (Part 1, Part 2, Part 3)© DataStax, All Rights Reserved. 23 24. © DataStax, All Rights Reserved. 24 Recommended Betsy Corcoran on Choosing the Right Technology for Your SchoolOnline Course - LinkedIn Learning Time Management Tips WeeklyOnline Course - LinkedIn Learning Learning to Teach OnlineOnline Course - LinkedIn Learning Migration from Thrift to CQL (Brij Bhushan Ravat, Ericsson) | Cassandra Summi...DataStax CQL Under the HoodRobbie Strickland Migration from MySQL to Cassandra for millions of active usersAndrey Panasyuk Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax Introduction to Apache Cassandra™ + What’s New in 4.0DataStax Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax About Blog Terms Privacy Copyright LinkedIn Corporation © 2019 × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng,…

Successfully reported this slideshow.

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian) | C* Summit 2016
Yabin Meng
An Effective Approach to Migrate Thrift to CQL
About Pythian
• Founded in 1997, a global leader in Enterprise IT transformation and operational excellence
•www.pythian.c...
About Myself
• 15+ years IT experience - relational database/data warehousing, business intelligence and
analytics (BIA), ...
1 Problem Overview
2 Background Discussion
3 Thrift to CQL Transition
4 C* Storage Engine (pre 3.0)
5 Data Migration
6 Con...
Problem Overview
• C* cluster upgrade from version 1.2 to 2.1
• Thrift based application (Hector)
•Plan to convert to CQL,...
Background Disccsion, COPY commands
• COPY commands
•Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> W...
Background Discussion, Schema-less?
• Schema-less or Schema
•Many NoSQL DBs is marked as “schema-less”
• A lot of debate
•...
Background Discussion, C* Utilities
•cassandra-cli utility
•Command line interface since very early years of C*
•Thrift-ba...
Thrift to CQL Transition
• What is Thrift and CQL?
•Thrift is a general framework (interface definition language and binar...
Thrift to CQL Transition, continued
• Thrift in C*
•Since the beginning as a communication protocol
•Low level API
•Deprec...
Thrift to CQL Transition, continued
• What is needed for the transition?
•Application
•Data Model
•Data itself
•Understand...
C* Storage Engine (pre 3.0)
• C* ColumnFamily (Thrift) / Table (CQL) data structure
•Map of sorted map:
Map<RowKey, Sorted...
C* Storage Engine – Thrift vs. CQL Terminology
• Table
–Thrift: Column Family
–CQL: Table
• Row
–Thrift: Storage Row / Par...
C* Storage Engine – Compact Storage
CREATE TABLE avg_grade2 (
student_id int,
class_id int,
grade double,
PRIMARY KEY (stu...
C* Storage Engine – Static Table Definition
create column family avg_grade
with key_validation_class = Int32Type
and compa...
C* Storage Engine – Dynamic Table Definition
Thrift / cassandra_cli
create column family sensor_data
with key_validation_c...
C* Storage Engine – Mixed Table Definition
Thrift / cassandra_cli
create column family blogs
with key_validation_class = I...
Migrate Thrift Data – Overview
• Do not fully trust C* schema
•“show schema” / “describe keyspace”
• Sample your data
•“ca...
Migrate Thrift Data – Determine CQL Schema
Thrift / cassandra-cli
create column family some_tbl_name
with key_validation_c...
Migrate Thrift Data - ETL Method
• ETL
•COPY command
•Commercial product (jasper soft, pentaho, …)
•Self developed
• Natur...
Migrate Thrift Data – Methods, continued
writer = CQLSSTableWriter.build()
thrift_rows = thirft_conn.fetchRows()
for each ...
Migrate Thrift Data - Result
Fetching result from source table and generate SSTable ...
-------------------------
> batch_...
Conclusion
• C* is moving from Thrift to CQL
–Do it !
• C* schema may not tell you all
• Effective C* data migration is no...
© DataStax, All Rights Reserved. 24

Upcoming SlideShare

Loading in …5

×

  • Be the first to like this

  1. 1. Yabin Meng An Effective Approach to Migrate Thrift to CQL
  2. 2. About Pythian • Founded in 1997, a global leader in Enterprise IT transformation and operational excellence •www.pythian.com •https://www.linkedin.com/company/pythian • Love your Data •6000+ database under management •Over 400 DBA’s, in 35 countries •Top 5% of DBA workforce – 9 Oracle ACE’s, 2 Microsoft MVP’s, 1 Cassandra MVP •Partners with Amazon, Cloudera, Oracle, Microsoft, DataStax, and a lot more • Specialities: Databases, Cloud, DevOps, Big Data Infrastructure, Advanced Analytics, Infrastructure Management • Industry Recognition •DBTA1 100 2016 - The Companies That Matter Most in Data •IAOP2 - GLOBAL OUTSOURCING 100® RISING STAR FOR 2016 •Top 25 Canadian ICT3 Professional Services Companies •… … DBTA1 : Database Trend and Applications (http://www.dbta.com/) IAOP2 : International Association of Outsourcing Professionals (https://www.iaop.org/) ICT3 : Information & Communication Technology © Pythian, All Rights Reserved. 2
  3. 3. About Myself • 15+ years IT experience - relational database/data warehousing, business intelligence and analytics (BIA), NoSQL database/Big Data • Started my career as a RDBMS client application developer (PowerBuilder, Java) • 5 years as IBM DB2 Kernel Developer, New Development (DB2 v9.5, 9.7, 9.8, 10) • Multi-years data warehousing and BIA experience in insurance and capital market sectors as technical lead/architect (IBM/Cognos, SAS/SAS 9.x, Information Builders/WebFocus) • DataStax Certified Developer, Administrator, and Partner Architect • https://www.linkedin.com/in/yabinmeng • @yabinmeng © Pythian, All Rights Reserved. 3
  4. 4. 1 Problem Overview 2 Background Discussion 3 Thrift to CQL Transition 4 C* Storage Engine (pre 3.0) 5 Data Migration 6 Conclusion 4© Pythian, All Rights Reserved.
  5. 5. Problem Overview • C* cluster upgrade from version 1.2 to 2.1 • Thrift based application (Hector) •Plan to convert to CQL, but not done yet • Data copied from one C* cluster to another •C* COPY command • Application failed in DEV/TEST •Cannot retrieve some required data •cqlsh output shows NO difference • Schema • Data query result © Pythian, All Rights Reserved. 5
  6. 6. Background Disccsion, COPY commands • COPY commands •Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> WITH options •Since 1.1.3 (PostgresSQL COPY commands) •Simple yet with limitation. Continuous improvement •Small datasets (max. a few million rows) •CSV format •No primary key check •New options and better performance in cqlsh copy – 2.1.13, 2.2.5, 3.0.3, 3.2 – ex (common): NUMPROCESSES, CONFIGFILE – ex (TO): BEGINTOKEN / ENDTOKEN – ex (FROM): INGESTRATE (rows/second) •Predefined schema © Pythian, All Rights Reserved. 6
  7. 7. Background Discussion, Schema-less? • Schema-less or Schema •Many NoSQL DBs is marked as “schema-less” • A lot of debate •C* is schema optional (Thrift) •C* is schema required (CQL) © Pythian, All Rights Reserved. 7
  8. 8. Background Discussion, C* Utilities •cassandra-cli utility •Command line interface since very early years of C* •Thrift-based counterpart of “cqlsh” •C* data engine format (pre 3.0) •Deprecated since C* 2.1 and will be removed in 3.0 •Comparator and Validator •Validator: row key, column value •Comparator: column name + column sorting order •sstable2json utility •Deprecated and will be removed in 3.0 SET users['key_val']['column1_name']='some value'; SET users['key_val']['column2_name']='some value'; ... LIST users; GET users['key_val']['column1_name']
  9. 9. Thrift to CQL Transition • What is Thrift and CQL? •Thrift is a general framework (interface definition language and binary communication protocol) • "scalable cross-language services development" •APIs to interact with C* •CQL is also a query language © Pythian, All Rights Reserved. 9 Thrift Thrift CQL Binary CQL query language CQL query language N/A * C* evolvement Protocol Layer Access Abstraction Layer C* Data Engine Storage Layer
  10. 10. Thrift to CQL Transition, continued • Thrift in C* •Since the beginning as a communication protocol •Low level API •Deprecated • CQL in C* •0.8, CQL as a language •1.2, CQL as a language, version 3 (CQL3) •1.2, CQL as a communication protocol (CQL3 only) •Continuously evolving • Thrift vs. CQL as protocol •Thrift: RPC based; Synchronous •CQL: Server notification (event registration); Asynchronous • CQL as a language •Feature rich: UDF, lightweight transaction, JSON support, materialized view, … • Thrift vs. CQL as client driver •Thrift driver is deprecated and not supported © Pythian, All Rights Reserved. 10
  11. 11. Thrift to CQL Transition, continued • What is needed for the transition? •Application •Data Model •Data itself •Understanding C* storage engine * * Pre-3.0 C* © Pythian, All Rights Reserved. 11
  12. 12. C* Storage Engine (pre 3.0) • C* ColumnFamily (Thrift) / Table (CQL) data structure •Map of sorted map: Map<RowKey, SortedMap<ColumnKey, ColumnValue>> CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 12
  13. 13. C* Storage Engine – Thrift vs. CQL Terminology • Table –Thrift: Column Family –CQL: Table • Row –Thrift: Storage Row / Partition –CQL: Logic Representation / Part of a partition • Column –Thrift: Cell (column key + column value) –CQL: Metadata / column key or column value © Pythian, All Rights Reserved. 13
  14. 14. C* Storage Engine – Compact Storage CREATE TABLE avg_grade2 ( student_id int, class_id int, grade double, PRIMARY KEY (student_id, class_id) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 14 • Storage space vs. table metadata – Backward compatibility – Thrift: always compact – CQL: optional • Caveats: – No adding / dropping column – Single non-key column if compound primary key – Non-frozen collection type
  15. 15. C* Storage Engine – Static Table Definition create column family avg_grade with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: class_id, validation_class: Int32Type}, {column_name: grade, validation_class: FloatType} ] create table avg_grade ( key int primary key, class_id int, grade float ) with compact storage •Equivalent definition •Thrift: no name for key -- default name used in CQL: “key” © Pythian, All Rights Reserved. 15 Thrift / cassandra_cli
  16. 16. C* Storage Engine – Dynamic Table Definition Thrift / cassandra_cli create column family sensor_data with key_validation_class = Int32Type and comparator = TimeUUIDType and default_validation_class = DecimalType; ------------------- RowKey: 1 => (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000) => (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000) ------------------- RowKey: 2 => (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000) • Thrift only • No “column_metadata” section © Pythian, All Rights Reserved. 16 CREATE TABLE sensor_data ( key int, column1 timeuuid, value decimal, PRIMARY KEY (key, column1) ) WITH COMPACT STORAGE •Default “key” for key name •Default “column1” for column name/key •Default “value” for column value •Alter Table
  17. 17. C* Storage Engine – Mixed Table Definition Thrift / cassandra_cli create column family blogs with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: author, validation_class: UTF8Type} ] © Pythian, All Rights Reserved. 17 CREATE TABLE blogs ( key int PRIMARY KEY, author text ) WITH COMPACT STORAGE key | author -----+-------- 1 | Donald 2 | John ------------------- RowKey: 1 => (name=author, value=Donald, timestamp=1462720696717000) => (name=tags:category, value=music, timestamp=1462720526053000) => (name=tags:rating, value=A, timestamp=1462720564590000) => (name=tags:recommend, value=Yes, timestamp=1462720640817000) ------------------- RowKey: 2 => (name=author, value=John, timestamp=1462720690911000)
  18. 18. Migrate Thrift Data – Overview • Do not fully trust C* schema •“show schema” / “describe keyspace” • Sample your data •“cassandra-cli”, yes •“cqlsh”, no • Review application design and code •Schema-less / schema -optional • Determine CQL schema • Data Migration Method © Pythian, All Rights Reserved. 18
  19. 19. Migrate Thrift Data – Determine CQL Schema Thrift / cassandra-cli create column family some_tbl_name with key_validation_class = BytesType and comparator = UTF8Type and column_metadata = [ {column_name: StaticColName1, validation_class: xxx1}, {column_name: StaticColName2, validation_class: xxx2} {column_name: StaticColName3, validation_class: xxx3} … … ] StaticColNameSet = {StaticColName1, StaticColName2, StaticColName3, …} DynamicColNameSet = {DynamicColName1, DynmaicColName2, DyanmicColName3, …} CQL / cqlsh schema • Application specific • Could it be a little bit more generic? CREATE TABLE some_tbl_name ( key blob, StaticColName1 yyy1, StaticColName2 yyy2, StaticColName3 yyy3, … …, DynamicColMapName map<text, blob>; PRIMARY KEY (key) ) © Pythian, All Rights Reserved. 19
  20. 20. Migrate Thrift Data - ETL Method • ETL •COPY command •Commercial product (jasper soft, pentaho, …) •Self developed • Natural Approach •Read every record from the source table and write it into the target table •Slow • C* load utility •sstableloader © Pythian, All Rights Reserved. 20
  21. 21. Migrate Thrift Data – Methods, continued writer = CQLSSTableWriter.build() thrift_rows = thirft_conn.fetchRows() for each row in thrift_rows() { keyValue = row.getKey() dynColMap = new Map() thirft_columns = row.getColumns() for each column in thirft_columns { colName = column.getName() if colName is in DynamicColNameSet { dynColMap.addKeyValuePair(column.name, column.value) } else { staticColVal1 = column.getValue(StaticColName1) staticColVal2 = column.getValue(StaticColName2) … … … } } } writer.addRow(keyValue, staticColVal1, staticColVal2, …, dynColMap) } writer.close() • Read source data • Thrift API • Writing SSTables Directly • SSTableSimpleUnsortedWriter • CQLSSTableWriter
  22. 22. Migrate Thrift Data - Result Fetching result from source table and generate SSTable ... ------------------------- > batch_run: 1; batch_size: 20001; batch_runtime: PT5.594S > batch_run: 2; batch_size: 20000; batch_runtime: PT4.368S > batch_run: 3; batch_size: 20000; batch_runtime: PT3.937S > batch_run: 4; batch_size: 20000; batch_runtime: PT3.838S > batch_run: 5; batch_size: 20000; batch_runtime: PT2.015S … … … > batch_run: 59; batch_size: 20000; batch_runtime: PT1.941S > batch_run: 60; batch_size: 12581; batch_runtime: PT2.82S #### Total Records Processed: 1192582; Total Run Time: PT2M35.178S $ sstableloader -d <node_ip> <sstable_root>/<keyspace_name>/<target_cql_table_name>/ … … … Summary statistics: Connections per host: : 1 Total files transferred: : 15 Total bytes transferred: : 511566477 Total duration (ms): : 47451 Average transfer rate (MB/s): : 10 Peak transfer rate (MB/s): : 20 © Pythian, All Rights Reserved. 22
  23. 23. Conclusion • C* is moving from Thrift to CQL –Do it ! • C* schema may not tell you all • Effective C* data migration is not easy •C* does have “load” utility (framework) • C* storage engine changes a lot in 3.0 • Pythian Blog Space: •https://www.pythian.com/blog/ •An Effective Approach to Migrate Dynamic Thrift Data to CQL (Part 1, Part 2, Part 3) © DataStax, All Rights Reserved. 23
  24. 24. © DataStax, All Rights Reserved. 24

×

Related Articles

cassandra
langchain
llamaindex

GitHub - michelderu/chat-with-your-data-in-cassandra: Chat with your data stored in DataStax Enterprise, Astra DB and Apache Cassandra - In Natural Language!

John Doe

3/26/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra