An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng,…

Successfully reported this slideshow.

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian) | C* Summit 2016
Yabin Meng
An Effective Approach to Migrate Thrift to CQL
About Pythian
• Founded in 1997, a global leader in Enterprise IT transformation and operational excellence
•www.pythian.c...
About Myself
• 15+ years IT experience - relational database/data warehousing, business intelligence and
analytics (BIA), ...
1 Problem Overview
2 Background Discussion
3 Thrift to CQL Transition
4 C* Storage Engine (pre 3.0)
5 Data Migration
6 Con...
Problem Overview
• C* cluster upgrade from version 1.2 to 2.1
• Thrift based application (Hector)
•Plan to convert to CQL,...
Background Disccsion, COPY commands
• COPY commands
•Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> W...
Background Discussion, Schema-less?
• Schema-less or Schema
•Many NoSQL DBs is marked as “schema-less”
• A lot of debate
•...
Background Discussion, C* Utilities
•cassandra-cli utility
•Command line interface since very early years of C*
•Thrift-ba...
Thrift to CQL Transition
• What is Thrift and CQL?
•Thrift is a general framework (interface definition language and binar...
Thrift to CQL Transition, continued
• Thrift in C*
•Since the beginning as a communication protocol
•Low level API
•Deprec...
Thrift to CQL Transition, continued
• What is needed for the transition?
•Application
•Data Model
•Data itself
•Understand...
C* Storage Engine (pre 3.0)
• C* ColumnFamily (Thrift) / Table (CQL) data structure
•Map of sorted map:
Map<RowKey, Sorted...
C* Storage Engine – Thrift vs. CQL Terminology
• Table
–Thrift: Column Family
–CQL: Table
• Row
–Thrift: Storage Row / Par...
C* Storage Engine – Compact Storage
CREATE TABLE avg_grade2 (
student_id int,
class_id int,
grade double,
PRIMARY KEY (stu...
C* Storage Engine – Static Table Definition
create column family avg_grade
with key_validation_class = Int32Type
and compa...
C* Storage Engine – Dynamic Table Definition
Thrift / cassandra_cli
create column family sensor_data
with key_validation_c...
C* Storage Engine – Mixed Table Definition
Thrift / cassandra_cli
create column family blogs
with key_validation_class = I...
Migrate Thrift Data – Overview
• Do not fully trust C* schema
•“show schema” / “describe keyspace”
• Sample your data
•“ca...
Migrate Thrift Data – Determine CQL Schema
Thrift / cassandra-cli
create column family some_tbl_name
with key_validation_c...
Migrate Thrift Data - ETL Method
• ETL
•COPY command
•Commercial product (jasper soft, pentaho, …)
•Self developed
• Natur...
Migrate Thrift Data – Methods, continued
writer = CQLSSTableWriter.build()
thrift_rows = thirft_conn.fetchRows()
for each ...
Migrate Thrift Data - Result
Fetching result from source table and generate SSTable ...
-------------------------
> batch_...
Conclusion
• C* is moving from Thrift to CQL
–Do it !
• C* schema may not tell you all
• Effective C* data migration is no...
© DataStax, All Rights Reserved. 24

Upcoming SlideShare

Loading in …5

×

  • Be the first to like this

  1. 1. Yabin Meng An Effective Approach to Migrate Thrift to CQL
  2. 2. About Pythian • Founded in 1997, a global leader in Enterprise IT transformation and operational excellence •www.pythian.com •https://www.linkedin.com/company/pythian • Love your Data •6000+ database under management •Over 400 DBA’s, in 35 countries •Top 5% of DBA workforce – 9 Oracle ACE’s, 2 Microsoft MVP’s, 1 Cassandra MVP •Partners with Amazon, Cloudera, Oracle, Microsoft, DataStax, and a lot more • Specialities: Databases, Cloud, DevOps, Big Data Infrastructure, Advanced Analytics, Infrastructure Management • Industry Recognition •DBTA1 100 2016 - The Companies That Matter Most in Data •IAOP2 - GLOBAL OUTSOURCING 100® RISING STAR FOR 2016 •Top 25 Canadian ICT3 Professional Services Companies •… … DBTA1 : Database Trend and Applications (http://www.dbta.com/) IAOP2 : International Association of Outsourcing Professionals (https://www.iaop.org/) ICT3 : Information & Communication Technology © Pythian, All Rights Reserved. 2
  3. 3. About Myself • 15+ years IT experience - relational database/data warehousing, business intelligence and analytics (BIA), NoSQL database/Big Data • Started my career as a RDBMS client application developer (PowerBuilder, Java) • 5 years as IBM DB2 Kernel Developer, New Development (DB2 v9.5, 9.7, 9.8, 10) • Multi-years data warehousing and BIA experience in insurance and capital market sectors as technical lead/architect (IBM/Cognos, SAS/SAS 9.x, Information Builders/WebFocus) • DataStax Certified Developer, Administrator, and Partner Architect • https://www.linkedin.com/in/yabinmeng • @yabinmeng © Pythian, All Rights Reserved. 3
  4. 4. 1 Problem Overview 2 Background Discussion 3 Thrift to CQL Transition 4 C* Storage Engine (pre 3.0) 5 Data Migration 6 Conclusion 4© Pythian, All Rights Reserved.
  5. 5. Problem Overview • C* cluster upgrade from version 1.2 to 2.1 • Thrift based application (Hector) •Plan to convert to CQL, but not done yet • Data copied from one C* cluster to another •C* COPY command • Application failed in DEV/TEST •Cannot retrieve some required data •cqlsh output shows NO difference • Schema • Data query result © Pythian, All Rights Reserved. 5
  6. 6. Background Disccsion, COPY commands • COPY commands •Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> WITH options •Since 1.1.3 (PostgresSQL COPY commands) •Simple yet with limitation. Continuous improvement •Small datasets (max. a few million rows) •CSV format •No primary key check •New options and better performance in cqlsh copy – 2.1.13, 2.2.5, 3.0.3, 3.2 – ex (common): NUMPROCESSES, CONFIGFILE – ex (TO): BEGINTOKEN / ENDTOKEN – ex (FROM): INGESTRATE (rows/second) •Predefined schema © Pythian, All Rights Reserved. 6
  7. 7. Background Discussion, Schema-less? • Schema-less or Schema •Many NoSQL DBs is marked as “schema-less” • A lot of debate •C* is schema optional (Thrift) •C* is schema required (CQL) © Pythian, All Rights Reserved. 7
  8. 8. Background Discussion, C* Utilities •cassandra-cli utility •Command line interface since very early years of C* •Thrift-based counterpart of “cqlsh” •C* data engine format (pre 3.0) •Deprecated since C* 2.1 and will be removed in 3.0 •Comparator and Validator •Validator: row key, column value •Comparator: column name + column sorting order •sstable2json utility •Deprecated and will be removed in 3.0 SET users['key_val']['column1_name']='some value'; SET users['key_val']['column2_name']='some value'; ... LIST users; GET users['key_val']['column1_name']
  9. 9. Thrift to CQL Transition • What is Thrift and CQL? •Thrift is a general framework (interface definition language and binary communication protocol) • "scalable cross-language services development" •APIs to interact with C* •CQL is also a query language © Pythian, All Rights Reserved. 9 Thrift Thrift CQL Binary CQL query language CQL query language N/A * C* evolvement Protocol Layer Access Abstraction Layer C* Data Engine Storage Layer
  10. 10. Thrift to CQL Transition, continued • Thrift in C* •Since the beginning as a communication protocol •Low level API •Deprecated • CQL in C* •0.8, CQL as a language •1.2, CQL as a language, version 3 (CQL3) •1.2, CQL as a communication protocol (CQL3 only) •Continuously evolving • Thrift vs. CQL as protocol •Thrift: RPC based; Synchronous •CQL: Server notification (event registration); Asynchronous • CQL as a language •Feature rich: UDF, lightweight transaction, JSON support, materialized view, … • Thrift vs. CQL as client driver •Thrift driver is deprecated and not supported © Pythian, All Rights Reserved. 10
  11. 11. Thrift to CQL Transition, continued • What is needed for the transition? •Application •Data Model •Data itself •Understanding C* storage engine * * Pre-3.0 C* © Pythian, All Rights Reserved. 11
  12. 12. C* Storage Engine (pre 3.0) • C* ColumnFamily (Thrift) / Table (CQL) data structure •Map of sorted map: Map<RowKey, SortedMap<ColumnKey, ColumnValue>> CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 12
  13. 13. C* Storage Engine – Thrift vs. CQL Terminology • Table –Thrift: Column Family –CQL: Table • Row –Thrift: Storage Row / Partition –CQL: Logic Representation / Part of a partition • Column –Thrift: Cell (column key + column value) –CQL: Metadata / column key or column value © Pythian, All Rights Reserved. 13
  14. 14. C* Storage Engine – Compact Storage CREATE TABLE avg_grade2 ( student_id int, class_id int, grade double, PRIMARY KEY (student_id, class_id) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 14 • Storage space vs. table metadata – Backward compatibility – Thrift: always compact – CQL: optional • Caveats: – No adding / dropping column – Single non-key column if compound primary key – Non-frozen collection type
  15. 15. C* Storage Engine – Static Table Definition create column family avg_grade with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: class_id, validation_class: Int32Type}, {column_name: grade, validation_class: FloatType} ] create table avg_grade ( key int primary key, class_id int, grade float ) with compact storage •Equivalent definition •Thrift: no name for key -- default name used in CQL: “key” © Pythian, All Rights Reserved. 15 Thrift / cassandra_cli
  16. 16. C* Storage Engine – Dynamic Table Definition Thrift / cassandra_cli create column family sensor_data with key_validation_class = Int32Type and comparator = TimeUUIDType and default_validation_class = DecimalType; ------------------- RowKey: 1 => (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000) => (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000) ------------------- RowKey: 2 => (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000) • Thrift only • No “column_metadata” section © Pythian, All Rights Reserved. 16 CREATE TABLE sensor_data ( key int, column1 timeuuid, value decimal, PRIMARY KEY (key, column1) ) WITH COMPACT STORAGE •Default “key” for key name •Default “column1” for column name/key •Default “value” for column value •Alter Table
  17. 17. C* Storage Engine – Mixed Table Definition Thrift / cassandra_cli create column family blogs with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: author, validation_class: UTF8Type} ] © Pythian, All Rights Reserved. 17 CREATE TABLE blogs ( key int PRIMARY KEY, author text ) WITH COMPACT STORAGE key | author -----+-------- 1 | Donald 2 | John ------------------- RowKey: 1 => (name=author, value=Donald, timestamp=1462720696717000) => (name=tags:category, value=music, timestamp=1462720526053000) => (name=tags:rating, value=A, timestamp=1462720564590000) => (name=tags:recommend, value=Yes, timestamp=1462720640817000) ------------------- RowKey: 2 => (name=author, value=John, timestamp=1462720690911000)
  18. 18. Migrate Thrift Data – Overview • Do not fully trust C* schema •“show schema” / “describe keyspace” • Sample your data •“cassandra-cli”, yes •“cqlsh”, no • Review application design and code •Schema-less / schema -optional • Determine CQL schema • Data Migration Method © Pythian, All Rights Reserved. 18
  19. 19. Migrate Thrift Data – Determine CQL Schema Thrift / cassandra-cli create column family some_tbl_name with key_validation_class = BytesType and comparator = UTF8Type and column_metadata = [ {column_name: StaticColName1, validation_class: xxx1}, {column_name: StaticColName2, validation_class: xxx2} {column_name: StaticColName3, validation_class: xxx3} … … ] StaticColNameSet = {StaticColName1, StaticColName2, StaticColName3, …} DynamicColNameSet = {DynamicColName1, DynmaicColName2, DyanmicColName3, …} CQL / cqlsh schema • Application specific • Could it be a little bit more generic? CREATE TABLE some_tbl_name ( key blob, StaticColName1 yyy1, StaticColName2 yyy2, StaticColName3 yyy3, … …, DynamicColMapName map<text, blob>; PRIMARY KEY (key) ) © Pythian, All Rights Reserved. 19
  20. 20. Migrate Thrift Data - ETL Method • ETL •COPY command •Commercial product (jasper soft, pentaho, …) •Self developed • Natural Approach •Read every record from the source table and write it into the target table •Slow • C* load utility •sstableloader © Pythian, All Rights Reserved. 20
  21. 21. Migrate Thrift Data – Methods, continued writer = CQLSSTableWriter.build() thrift_rows = thirft_conn.fetchRows() for each row in thrift_rows() { keyValue = row.getKey() dynColMap = new Map() thirft_columns = row.getColumns() for each column in thirft_columns { colName = column.getName() if colName is in DynamicColNameSet { dynColMap.addKeyValuePair(column.name, column.value) } else { staticColVal1 = column.getValue(StaticColName1) staticColVal2 = column.getValue(StaticColName2) … … … } } } writer.addRow(keyValue, staticColVal1, staticColVal2, …, dynColMap) } writer.close() • Read source data • Thrift API • Writing SSTables Directly • SSTableSimpleUnsortedWriter • CQLSSTableWriter
  22. 22. Migrate Thrift Data - Result Fetching result from source table and generate SSTable ... ------------------------- > batch_run: 1; batch_size: 20001; batch_runtime: PT5.594S > batch_run: 2; batch_size: 20000; batch_runtime: PT4.368S > batch_run: 3; batch_size: 20000; batch_runtime: PT3.937S > batch_run: 4; batch_size: 20000; batch_runtime: PT3.838S > batch_run: 5; batch_size: 20000; batch_runtime: PT2.015S … … … > batch_run: 59; batch_size: 20000; batch_runtime: PT1.941S > batch_run: 60; batch_size: 12581; batch_runtime: PT2.82S #### Total Records Processed: 1192582; Total Run Time: PT2M35.178S $ sstableloader -d <node_ip> <sstable_root>/<keyspace_name>/<target_cql_table_name>/ … … … Summary statistics: Connections per host: : 1 Total files transferred: : 15 Total bytes transferred: : 511566477 Total duration (ms): : 47451 Average transfer rate (MB/s): : 10 Peak transfer rate (MB/s): : 20 © Pythian, All Rights Reserved. 22
  23. 23. Conclusion • C* is moving from Thrift to CQL –Do it ! • C* schema may not tell you all • Effective C* data migration is not easy •C* does have “load” utility (framework) • C* storage engine changes a lot in 3.0 • Pythian Blog Space: •https://www.pythian.com/blog/ •An Effective Approach to Migrate Dynamic Thrift Data to CQL (Part 1, Part 2, Part 3) © DataStax, All Rights Reserved. 23
  24. 24. © DataStax, All Rights Reserved. 24

×