10/23/2020

Reading time:4 min

DataStax-Examples/dsbulk-to-astra

by DataStax-Examples

If you're trying to load data into Astra from a CSV file or from an existing Cassandra table, then you've come to the right place. This example shows how to quickly load data into Astra using the DataStax Bulk Loader (DSBulk for short).Contributor(s): Dave Bechberger based on the work of Brian HessObjectivesShow how to load data into Astra from a CSV file on the filesystem or from an existing table in CassandraProject Layoutdata.csv - The CSV data to loadschema.cql - The CQL schema used for this exampleHow this WorksLoading data into Astra using DSBulk is much like loading data into other Cassandra databases with the addition of the requirement to specify the secure connect bundle as well as the username and password for your Astra database.The secure connect bundle is specified using the -b <INSERT PATH> parameter on the command line. See here for more detailsThe username is specified using the -u <INSERT USERNAME> parameter on the command line. See here for more detailsThe password is specified using the -p <INSERT PASSWORD> parameter on the command line. See here for more detailsThis example only touches the tip of the iceberg of functionality. DSBulk has all the functionality to perform complex loading operations to Astra as it does to other DDAC and DSE clusters. Check out the docs below for details of the other things it can do:DataStax Bulk Loader DocumentationDataStax Bulk Loader: Introduction and LoadingDataStax Bulk Loader: More loadingDataStax Bulk Loader: Common SettingsDataStax Bulk Loader: CountingSetup and RunningPrerequisitesDS Bulk v1.4.0 or greaterAn Astra cluster with the schema (from schema.cql) loaded and credential informationNote If you need further instruction on how to obtain the secure connect bundle for your Astra instance then please refer to the documentation located here.A Cassandra cluster (optional if you want to load from Cassandra)RunningTo migrate data into Astra using DS Bulk you first need to ensure that the target Astra keyspace has had the schema for the video_ratings_by_user table created. This is done via using the DataStax Developer Studio that is embedded in your Astra instance. For more information on how to use the embedded Studio instance please check the documentation located here.Loading from CSVHere is an example command that will load the data.csv file into the video_ratings_by_user table in your Astra instance.Note This loads the data from the file stored in the github repo so the machine running this command will need access to the internet../dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>Loading from an existing Cassandra tableTo load data from an existing table in a Cassandra keyspace into Astra there are two options to accomplish this.Option 1 - Unload and Load in Separate StepsThe first option for loading data from an existing Cassandra cluster into Astra requires that you unload the data from the Cassandra cluster into a local file and then load the data into Astra. The commands to accomplish this look like this:./dsbulk unload -h <CASSANDRA CLUSTER IP> -k <KEYSPACE NAME> -t video_ratings_by_user -url /path/to/file/migrate.csv./dsbulk load -url /path/to/file/migrate.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>Option 2 - Unload and Load by Chaining StepsThe second option for loading data from an existing Cassandra cluster into Astra requires that you unload the data from the Cassandra cluster and pipe that into a command load the data into Astra. This has some advantages as it will run in a single command but it will only run single threaded as it uses stdin/stdout. The commands to accomplish this look like this:./dsbulk unload -h <CASSANDRA CLUSTER IP> -k <KEYSPACE NAME> -t video_ratings_by_user -url /path/to/file/migrate.csv | ./dsbulk load -url /path/to/file/migrate.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>Validating the ResultsAfter running any of these commands you should see a result printed to the screen similar tototal | failed | rows/s | p50ms | p99ms | p999ms | batches 101 | 0 | 94 | 63.92 | 70.25 | 70.25 | 10.10Operation LOAD_20191113-185907-331567 completed successfully in 0 seconds.Last processed positions can be found in positions.txtIf you would like to check to see that all your data has loaded correctly then you can use the count functionality of DS Bulk to verify that the data has been loaded using the command below:./dsbulk count -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p If you were following along with this example you will get a number of 101 rows.

Read this article if you want to know more about DataStax-Examples/dsbulk-to-astra

If you're trying to load data into Astra from a CSV file or from an existing Cassandra table, then you've come to the right place. This example shows how to quickly load data into Astra using the DataStax Bulk Loader (DSBulk for short).

Contributor(s): Dave Bechberger based on the work of Brian Hess

Objectives

Show how to load data into Astra from a CSV file on the filesystem or from an existing table in Cassandra

Project Layout

data.csv - The CSV data to load
schema.cql - The CQL schema used for this example

How this Works

Loading data into Astra using DSBulk is much like loading data into other Cassandra databases with the addition of the requirement to specify the secure connect bundle as well as the username and password for your Astra database.

The secure connect bundle is specified using the -b <INSERT PATH> parameter on the command line. See here for more details

The username is specified using the -u <INSERT USERNAME> parameter on the command line. See here for more details

The password is specified using the -p <INSERT PASSWORD> parameter on the command line. See here for more details

This example only touches the tip of the iceberg of functionality. DSBulk has all the functionality to perform complex loading operations to Astra as it does to other DDAC and DSE clusters. Check out the docs below for details of the other things it can do:

Setup and Running

Prerequisites

DS Bulk v1.4.0 or greater
An Astra cluster with the schema (from schema.cql) loaded and credential information Note If you need further instruction on how to obtain the secure connect bundle for your Astra instance then please refer to the documentation located here.
A Cassandra cluster (optional if you want to load from Cassandra)

Running

To migrate data into Astra using DS Bulk you first need to ensure that the target Astra keyspace has had the schema for the video_ratings_by_user table created. This is done via using the DataStax Developer Studio that is embedded in your Astra instance. For more information on how to use the embedded Studio instance please check the documentation located here.

Loading from CSV

Here is an example command that will load the data.csv file into the video_ratings_by_user table in your Astra instance.

Note This loads the data from the file stored in the github repo so the machine running this command will need access to the internet.

./dsbulk load -url https://raw.githubusercontent.com/DataStax-Examples/dsbulk-to-astra/master/data.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>

Loading from an existing Cassandra table

To load data from an existing table in a Cassandra keyspace into Astra there are two options to accomplish this.

Option 1 - Unload and Load in Separate Steps

The first option for loading data from an existing Cassandra cluster into Astra requires that you unload the data from the Cassandra cluster into a local file and then load the data into Astra. The commands to accomplish this look like this:

./dsbulk unload -h <CASSANDRA CLUSTER IP> -k <KEYSPACE NAME> -t video_ratings_by_user -url /path/to/file/migrate.csv
./dsbulk load -url /path/to/file/migrate.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>

Option 2 - Unload and Load by Chaining Steps

The second option for loading data from an existing Cassandra cluster into Astra requires that you unload the data from the Cassandra cluster and pipe that into a command load the data into Astra. This has some advantages as it will run in a single command but it will only run single threaded as it uses stdin/stdout. The commands to accomplish this look like this:

./dsbulk unload -h <CASSANDRA CLUSTER IP> -k <KEYSPACE NAME> -t video_ratings_by_user -url /path/to/file/migrate.csv | ./dsbulk load -url /path/to/file/migrate.csv -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p <PASSWORD>

Validating the Results

After running any of these commands you should see a result printed to the screen similar to

total | failed | rows/s | p50ms | p99ms | p999ms | batches
  101 |      0 |     94 | 63.92 | 70.25 |  70.25 |   10.10
Operation LOAD_20191113-185907-331567 completed successfully in 0 seconds.
Last processed positions can be found in positions.txt

If you would like to check to see that all your data has loaded correctly then you can use the count functionality of DS Bulk to verify that the data has been loaded using the command below:

./dsbulk count -b /path/to/bundle.zip -k <KEYSPACE NAME> -t video_ratings_by_user -u <USERNAME> -p

If you were following along with this example you will get a number of 101 rows.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.