6/19/2020

Reading time:2 min

How to Configure Druid to Use Cassandra as Deep Storage

by John Doe

Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment). In this post, I will show you how to configure Apache Cassandra deep storage for druid cluster.Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables: index_storage and descriptor_storage. Underneath the hood, the Cassandra integration leverages Astyanax. The index storage table is a Chunked Object repository. It contains compressed segments for distribution to historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that stores the segment metadata.I’m assuming you already have Cassandra installed. If not installed yet, follow this post to install Apache Cassandra.SchemaOpen terminal and go to Cassandra installation directory and run:./bin/cqlshThis will open Cassandra command line interface. Now create a new keyspace named druid.CREATE KEYSPACE IF NOT EXISTS druid WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};USE druid;Now create the schema. Below are the create statements for each:CREATE TABLE index_storage(key text, chunk text, value blob, PRIMARY KEY (key, chunk)) WITH COMPACT STORAGE;CREATE TABLE descriptor_storage(key varchar, lastModified timestamp, descriptor varchar, PRIMARY KEY (key)) WITH COMPACT STORAGE;Extensiondruid-cassandra-storage is a community extension and does not come up with the distribution by default. You need to either download the extension jar files from maven or build it from the source.Here, we’re going to use pull-deps to download the extension from maven. Go to dist/druid/ directory from terminal and run:java \ -cp "lib/*" \ -Ddruid.extensions.directory="extensions" \ -Ddruid.extensions.hadoopDependenciesDir="hadoop-dependencies" \ io.druid.cli.Main tools pull-deps \ --no-default-hadoop \ -c "io.druid.extensions.contrib:druid-cassandra-storage:0.12.0"This will download the druid-cassandra-storage extension from maven into extensions directory.Now In conf/druid/_common/common.runtime.properties, add “druid-cassandra-storage” to druid.extensions.loadList. If for example the list already contains “druid-parser-route”, the final property should look like:druid.extensions.loadList=["druid-parser-route", "druid-cassandra-storage"].Comment out the configurations for local storage under “Deep Storage” section and add appropriate values for Cassandra. After this, “Deep Storage” section should look like:## Deep storage## For local disk (only viable in a cluster if this is a network mount):#druid.storage.type=local#druid.storage.storageDirectory=var/druid/segments# For HDFS:#druid.storage.type=hdfs#druid.storage.storageDirectory=/druid/segments# For S3:#druid.storage.type=s3#druid.storage.bucket=your-bucket#druid.storage.baseKey=druid/segments#druid.s3.accessKey=...#druid.s3.secretKey=...# For Cassandradruid.storage.type=c*druid.storage.host=localhost:9160druid.storage.keyspace=druidYou’re done. Now restart the servers to take effect. To test if it is working, load the sample data in druid and see segments data in Cassandra schemas using cqlsh.You may also like

Read this article if you want to know more about How to Configure Druid to Use Cassandra as Deep Storage

Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment). In this post, I will show you how to configure Apache Cassandra deep storage for druid cluster.

Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables: index_storage and descriptor_storage. Underneath the hood, the Cassandra integration leverages Astyanax. The index storage table is a Chunked Object repository. It contains compressed segments for distribution to historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that stores the segment metadata.

I’m assuming you already have Cassandra installed. If not installed yet, follow this post to install Apache Cassandra.

Schema

Open terminal and go to Cassandra installation directory and run:

./bin/cqlsh

This will open Cassandra command line interface. Now create a new keyspace named druid.

CREATE KEYSPACE IF NOT EXISTS druid WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
USE druid;

Now create the schema. Below are the create statements for each:

CREATE TABLE index_storage(key text,
                           chunk text,
                           value blob,
                           PRIMARY KEY (key, chunk)) WITH COMPACT STORAGE;
CREATE TABLE descriptor_storage(key varchar,
                                lastModified timestamp,
                                descriptor varchar,
                                PRIMARY KEY (key)) WITH COMPACT STORAGE;

Extension

druid-cassandra-storage is a community extension and does not come up with the distribution by default. You need to either download the extension jar files from maven or build it from the source.

Here, we’re going to use pull-deps to download the extension from maven. Go to dist/druid/ directory from terminal and run:

java \
  -cp "lib/*" \
  -Ddruid.extensions.directory="extensions" \
  -Ddruid.extensions.hadoopDependenciesDir="hadoop-dependencies" \
  io.druid.cli.Main tools pull-deps \
  --no-default-hadoop \
  -c "io.druid.extensions.contrib:druid-cassandra-storage:0.12.0"

This will download the druid-cassandra-storage extension from maven into extensions directory.

Now In conf/druid/_common/common.runtime.properties, add “druid-cassandra-storage” to druid.extensions.loadList. If for example the list already contains “druid-parser-route”, the final property should look like:

druid.extensions.loadList=["druid-parser-route", "druid-cassandra-storage"].

Comment out the configurations for local storage under “Deep Storage” section and add appropriate values for Cassandra. After this, “Deep Storage” section should look like:

#
# Deep storage
#
# For local disk (only viable in a cluster if this is a network mount):
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
# For HDFS:
#druid.storage.type=hdfs
#druid.storage.storageDirectory=/druid/segments
# For S3:
#druid.storage.type=s3
#druid.storage.bucket=your-bucket
#druid.storage.baseKey=druid/segments
#druid.s3.accessKey=...
#druid.s3.secretKey=...
# For Cassandra
druid.storage.type=c*
druid.storage.host=localhost:9160
druid.storage.keyspace=druid

You’re done. Now restart the servers to take effect. To test if it is working, load the sample data in druid and see segments data in Cassandra schemas using cqlsh.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Schema

Extension

You may also like

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us