Successfully reported this slideshow.
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian) | C* Summit 2016
Upcoming SlideShare
Loading in …5
×
-
Be the first to like this
No Downloads
No notes for slide
- 1. Yabin Meng An Effective Approach to Migrate Thrift to CQL
- 2. About Pythian • Founded in 1997, a global leader in Enterprise IT transformation and operational excellence •www.pythian.com •https://www.linkedin.com/company/pythian • Love your Data •6000+ database under management •Over 400 DBA’s, in 35 countries •Top 5% of DBA workforce – 9 Oracle ACE’s, 2 Microsoft MVP’s, 1 Cassandra MVP •Partners with Amazon, Cloudera, Oracle, Microsoft, DataStax, and a lot more • Specialities: Databases, Cloud, DevOps, Big Data Infrastructure, Advanced Analytics, Infrastructure Management • Industry Recognition •DBTA1 100 2016 - The Companies That Matter Most in Data •IAOP2 - GLOBAL OUTSOURCING 100® RISING STAR FOR 2016 •Top 25 Canadian ICT3 Professional Services Companies •… … DBTA1 : Database Trend and Applications (http://www.dbta.com/) IAOP2 : International Association of Outsourcing Professionals (https://www.iaop.org/) ICT3 : Information & Communication Technology © Pythian, All Rights Reserved. 2
- 3. About Myself • 15+ years IT experience - relational database/data warehousing, business intelligence and analytics (BIA), NoSQL database/Big Data • Started my career as a RDBMS client application developer (PowerBuilder, Java) • 5 years as IBM DB2 Kernel Developer, New Development (DB2 v9.5, 9.7, 9.8, 10) • Multi-years data warehousing and BIA experience in insurance and capital market sectors as technical lead/architect (IBM/Cognos, SAS/SAS 9.x, Information Builders/WebFocus) • DataStax Certified Developer, Administrator, and Partner Architect • https://www.linkedin.com/in/yabinmeng • @yabinmeng © Pythian, All Rights Reserved. 3
- 4. 1 Problem Overview 2 Background Discussion 3 Thrift to CQL Transition 4 C* Storage Engine (pre 3.0) 5 Data Migration 6 Conclusion 4© Pythian, All Rights Reserved.
- 5. Problem Overview • C* cluster upgrade from version 1.2 to 2.1 • Thrift based application (Hector) •Plan to convert to CQL, but not done yet • Data copied from one C* cluster to another •C* COPY command • Application failed in DEV/TEST •Cannot retrieve some required data •cqlsh output shows NO difference • Schema • Data query result © Pythian, All Rights Reserved. 5
- 6. Background Disccsion, COPY commands • COPY commands •Syntax: COPY <table_name> (column1, column2, …) TO/FROM <file_name> WITH options •Since 1.1.3 (PostgresSQL COPY commands) •Simple yet with limitation. Continuous improvement •Small datasets (max. a few million rows) •CSV format •No primary key check •New options and better performance in cqlsh copy – 2.1.13, 2.2.5, 3.0.3, 3.2 – ex (common): NUMPROCESSES, CONFIGFILE – ex (TO): BEGINTOKEN / ENDTOKEN – ex (FROM): INGESTRATE (rows/second) •Predefined schema © Pythian, All Rights Reserved. 6
- 7. Background Discussion, Schema-less? • Schema-less or Schema •Many NoSQL DBs is marked as “schema-less” • A lot of debate •C* is schema optional (Thrift) •C* is schema required (CQL) © Pythian, All Rights Reserved. 7
- 8. Background Discussion, C* Utilities •cassandra-cli utility •Command line interface since very early years of C* •Thrift-based counterpart of “cqlsh” •C* data engine format (pre 3.0) •Deprecated since C* 2.1 and will be removed in 3.0 •Comparator and Validator •Validator: row key, column value •Comparator: column name + column sorting order •sstable2json utility •Deprecated and will be removed in 3.0 SET users['key_val']['column1_name']='some value'; SET users['key_val']['column2_name']='some value'; ... LIST users; GET users['key_val']['column1_name']
- 9. Thrift to CQL Transition • What is Thrift and CQL? •Thrift is a general framework (interface definition language and binary communication protocol) • "scalable cross-language services development" •APIs to interact with C* •CQL is also a query language © Pythian, All Rights Reserved. 9 Thrift Thrift CQL Binary CQL query language CQL query language N/A * C* evolvement Protocol Layer Access Abstraction Layer C* Data Engine Storage Layer
- 10. Thrift to CQL Transition, continued • Thrift in C* •Since the beginning as a communication protocol •Low level API •Deprecated • CQL in C* •0.8, CQL as a language •1.2, CQL as a language, version 3 (CQL3) •1.2, CQL as a communication protocol (CQL3 only) •Continuously evolving • Thrift vs. CQL as protocol •Thrift: RPC based; Synchronous •CQL: Server notification (event registration); Asynchronous • CQL as a language •Feature rich: UDF, lightweight transaction, JSON support, materialized view, … • Thrift vs. CQL as client driver •Thrift driver is deprecated and not supported © Pythian, All Rights Reserved. 10
- 11. Thrift to CQL Transition, continued • What is needed for the transition? •Application •Data Model •Data itself •Understanding C* storage engine * * Pre-3.0 C* © Pythian, All Rights Reserved. 11
- 12. C* Storage Engine (pre 3.0) • C* ColumnFamily (Thrift) / Table (CQL) data structure •Map of sorted map: Map<RowKey, SortedMap<ColumnKey, ColumnValue>> CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 12
- 13. C* Storage Engine – Thrift vs. CQL Terminology • Table –Thrift: Column Family –CQL: Table • Row –Thrift: Storage Row / Partition –CQL: Logic Representation / Part of a partition • Column –Thrift: Cell (column key + column value) –CQL: Metadata / column key or column value © Pythian, All Rights Reserved. 13
- 14. C* Storage Engine – Compact Storage CREATE TABLE avg_grade2 ( student_id int, class_id int, grade double, PRIMARY KEY (student_id, class_id) ) WITH COMPACT STORAGE © Pythian, All Rights Reserved. 14 • Storage space vs. table metadata – Backward compatibility – Thrift: always compact – CQL: optional • Caveats: – No adding / dropping column – Single non-key column if compound primary key – Non-frozen collection type
- 15. C* Storage Engine – Static Table Definition create column family avg_grade with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: class_id, validation_class: Int32Type}, {column_name: grade, validation_class: FloatType} ] create table avg_grade ( key int primary key, class_id int, grade float ) with compact storage •Equivalent definition •Thrift: no name for key -- default name used in CQL: “key” © Pythian, All Rights Reserved. 15 Thrift / cassandra_cli
- 16. C* Storage Engine – Dynamic Table Definition Thrift / cassandra_cli create column family sensor_data with key_validation_class = Int32Type and comparator = TimeUUIDType and default_validation_class = DecimalType; ------------------- RowKey: 1 => (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000) => (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000) ------------------- RowKey: 2 => (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000) • Thrift only • No “column_metadata” section © Pythian, All Rights Reserved. 16 CREATE TABLE sensor_data ( key int, column1 timeuuid, value decimal, PRIMARY KEY (key, column1) ) WITH COMPACT STORAGE •Default “key” for key name •Default “column1” for column name/key •Default “value” for column value •Alter Table
- 17. C* Storage Engine – Mixed Table Definition Thrift / cassandra_cli create column family blogs with key_validation_class = Int32Type and comparator = UTF8Type and column_metadata = [ {column_name: author, validation_class: UTF8Type} ] © Pythian, All Rights Reserved. 17 CREATE TABLE blogs ( key int PRIMARY KEY, author text ) WITH COMPACT STORAGE key | author -----+-------- 1 | Donald 2 | John ------------------- RowKey: 1 => (name=author, value=Donald, timestamp=1462720696717000) => (name=tags:category, value=music, timestamp=1462720526053000) => (name=tags:rating, value=A, timestamp=1462720564590000) => (name=tags:recommend, value=Yes, timestamp=1462720640817000) ------------------- RowKey: 2 => (name=author, value=John, timestamp=1462720690911000)
- 18. Migrate Thrift Data – Overview • Do not fully trust C* schema •“show schema” / “describe keyspace” • Sample your data •“cassandra-cli”, yes •“cqlsh”, no • Review application design and code •Schema-less / schema -optional • Determine CQL schema • Data Migration Method © Pythian, All Rights Reserved. 18
- 19. Migrate Thrift Data – Determine CQL Schema Thrift / cassandra-cli create column family some_tbl_name with key_validation_class = BytesType and comparator = UTF8Type and column_metadata = [ {column_name: StaticColName1, validation_class: xxx1}, {column_name: StaticColName2, validation_class: xxx2} {column_name: StaticColName3, validation_class: xxx3} … … ] StaticColNameSet = {StaticColName1, StaticColName2, StaticColName3, …} DynamicColNameSet = {DynamicColName1, DynmaicColName2, DyanmicColName3, …} CQL / cqlsh schema • Application specific • Could it be a little bit more generic? CREATE TABLE some_tbl_name ( key blob, StaticColName1 yyy1, StaticColName2 yyy2, StaticColName3 yyy3, … …, DynamicColMapName map<text, blob>; PRIMARY KEY (key) ) © Pythian, All Rights Reserved. 19
- 20. Migrate Thrift Data - ETL Method • ETL •COPY command •Commercial product (jasper soft, pentaho, …) •Self developed • Natural Approach •Read every record from the source table and write it into the target table •Slow • C* load utility •sstableloader © Pythian, All Rights Reserved. 20
- 21. Migrate Thrift Data – Methods, continued writer = CQLSSTableWriter.build() thrift_rows = thirft_conn.fetchRows() for each row in thrift_rows() { keyValue = row.getKey() dynColMap = new Map() thirft_columns = row.getColumns() for each column in thirft_columns { colName = column.getName() if colName is in DynamicColNameSet { dynColMap.addKeyValuePair(column.name, column.value) } else { staticColVal1 = column.getValue(StaticColName1) staticColVal2 = column.getValue(StaticColName2) … … … } } } writer.addRow(keyValue, staticColVal1, staticColVal2, …, dynColMap) } writer.close() • Read source data • Thrift API • Writing SSTables Directly • SSTableSimpleUnsortedWriter • CQLSSTableWriter
- 22. Migrate Thrift Data - Result Fetching result from source table and generate SSTable ... ------------------------- > batch_run: 1; batch_size: 20001; batch_runtime: PT5.594S > batch_run: 2; batch_size: 20000; batch_runtime: PT4.368S > batch_run: 3; batch_size: 20000; batch_runtime: PT3.937S > batch_run: 4; batch_size: 20000; batch_runtime: PT3.838S > batch_run: 5; batch_size: 20000; batch_runtime: PT2.015S … … … > batch_run: 59; batch_size: 20000; batch_runtime: PT1.941S > batch_run: 60; batch_size: 12581; batch_runtime: PT2.82S #### Total Records Processed: 1192582; Total Run Time: PT2M35.178S $ sstableloader -d <node_ip> <sstable_root>/<keyspace_name>/<target_cql_table_name>/ … … … Summary statistics: Connections per host: : 1 Total files transferred: : 15 Total bytes transferred: : 511566477 Total duration (ms): : 47451 Average transfer rate (MB/s): : 10 Peak transfer rate (MB/s): : 20 © Pythian, All Rights Reserved. 22
- 23. Conclusion • C* is moving from Thrift to CQL –Do it ! • C* schema may not tell you all • Effective C* data migration is not easy •C* does have “load” utility (framework) • C* storage engine changes a lot in 3.0 • Pythian Blog Space: •https://www.pythian.com/blog/ •An Effective Approach to Migrate Dynamic Thrift Data to CQL (Part 1, Part 2, Part 3) © DataStax, All Rights Reserved. 23
- 24. © DataStax, All Rights Reserved. 24
Public clipboards featuring this slide
No public clipboards found for this slide