Bulk Loading Data into Cassandra

Successfully reported this slideshow.

Bulk Loading Data into Cassandra
Planet Cassandra 2014

Bulk-Loading Data into Cassandra
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastp...
About Us
•

Work with clients to deliver and improve
Apache Cassandra services


•

Apache Cassandra committer, Datastax
M...
Why is bulk loading useful?
•

Performance tests
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data

•

Changing topologies
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
Cassandra Write Path

write[0]
Cassandra Write Path
•

write[0]

Writes written to both the commit log and
memtable.

commitlog

memtable
Cassandra Write Path
•

•

write[0]

Writes written to both the commit log and
memtable.

Memtable is sorted.

commitlog

...
Cassandra Write Path
•

write[0]

Memtable flushed out to sstables.

commitlog

memtable

sstable[0]
sstable[2]
sstable[1]
Cassandra Write Path
•

write[0]

Compaction helps keep the read latency
low.

commitlog

memtable

sstable[0]
sstable[2]
...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {repli...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024));
KeyGenerator keyGen = new KeyGenerator();
long dataSize ...
patricia@dev:~/../data$
total 64
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command

•

Parallelise processes
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats
pending tasks: 30
A...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
cqlsh> CREATE KEYSPACE "test"
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
!

cqlsh> CREATE C...
CQL3 Considerations
•

Uses CompositeType comparator
Planet Cassandra 2014

Q&A
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastpickle.com

Upcoming SlideShare

Loading in …5

×

  1. 1. Planet Cassandra 2014 Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com
  2. 2. About Us • Work with clients to deliver and improve Apache Cassandra services • Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer • Based in New Zealand & USA
  3. 3. Why is bulk loading useful? • Performance tests
  4. 4. Why is bulk loading useful? • Performance tests • Migrating historical data
  5. 5. Why is bulk loading useful? • Performance tests • Migrating historical data • Changing topologies
  6. 6. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  7. 7. Cassandra Write Path write[0]
  8. 8. Cassandra Write Path • write[0] Writes written to both the commit log and memtable. commitlog memtable
  9. 9. Cassandra Write Path • • write[0] Writes written to both the commit log and memtable. Memtable is sorted. commitlog memtable
  10. 10. Cassandra Write Path • write[0] Memtable flushed out to sstables. commitlog memtable sstable[0] sstable[2] sstable[1]
  11. 11. Cassandra Write Path • write[0] Compaction helps keep the read latency low. commitlog memtable sstable[0] sstable[2] sstable[1] sstable[n]
  12. 12. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
  13. 13. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Contains all data needed to regenerate components
  14. 14. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index of row keys
  15. 15. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index summary from Index.db file
  16. 16. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Bloom filter over sstable
  17. 17. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Table of contents of all components
  18. 18. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  19. 19. create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; ! create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType'; Set up keyspace and column family
  20. 20. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  21. 21. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  22. 22. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  23. 23. ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
  24. 24. patricia@dev:~/../data$ total 64 -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia ls -lh mykeyspace/mycf staff staff staff staff staff staff staff 43B 79K 16B 36B 4.3K 80B 79B Feb Feb Feb Feb Feb Feb Feb 2 2 2 2 2 2 2 15:31 15:31 15:31 15:31 15:31 15:31 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Examining sstable output
  25. 25. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  26. 26. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  27. 27. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  28. 28. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  29. 29. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server
  30. 30. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command
  31. 31. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command • Parallelise processes
  32. 32. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  33. 33. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  34. 34. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  35. 35. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  36. 36. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  37. 37. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  38. 38. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  39. 39. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  40. 40. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
  41. 41. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  42. 42. cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; ! cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ; CQL: Keep schema consistent
  43. 43. CQL3 Considerations • Uses CompositeType comparator
  44. 44. Planet Cassandra 2014 Q&A Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com

×