Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

7/22/2021

Reading time:5 min

Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra - Business Platform Team

by Josh Barnes

In Cassandra Lunch #57: Using Secondary Indexes in Cassandra, guest speaker Anil Mittana presented on using Secondary Indexes in Cassandra. This blog post is to give an overview of the presentation. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!Secondary Indexes in CassandraIn this blog post, we will be writing an overview of the presentation given by Anil Mittana covering secondary indexes in Cassandra.Query First ApproachIn Cassandra, tables are created with the intention of facilitating future queries. For example, the following command in CQL would create a Table with the intention of querying for a rating by movie title. We won’t dive into the details here as this blog post is intended to be about the secondary indexes.Secondary IndexesSecondary Indexes are meant to help facilitate queries on columns in a table that are not a part of the primary key. However, they can cause performance issues, especially if a query needs to access multiple nodes, and should be applied carefully. They should only be used on columns or tables that have low cardinality, do not contain a counter, are infrequently updated, or tables that do not have large partitions.How Cassandra Stores an IndexWhen indexes are created, a hidden table is created in a background process. To query a secondary index the partition key and secondary index column should be included in order to be successful. By including the partition key and the secondary index column only one node will need to be queried.Distributed Index vs. Local IndexTables and materialized views are examples of distributed indexing. A table or view data structure is distributed across all nodes in a cluster based on a partition key. When retrieving data using a partition key, Cassandra knows exactly which replica nodes may contain the result. For example, given a 100-node cluster with the replication factor of 5, at most 5 replica nodes and 1 coordinator node are needed to participate in a query.In contrast, secondary indexes are examples of local indexing. A secondary index is represented by many independent data structures that index data stored on each node. When retrieving data using only an indexed column, Cassandra has no way to determine which nodes may have necessary data and has to query all nodes in a cluster. For example, given a 100-node cluster with any replication factor, all 100 nodes have to search their local index data structures. This does not scale well.Write and Read paths of Secondary IndexesWhenever a mutation, writing to a table, is applied to a base table in memory (memtable), it is dispatched as notification to all registered indices on this table so that each index implementation can apply the necessary processing. Index memtable and base memtable will generally be flushed to SSTables at the same time but there is no strong guarantee of this behavior. Once flushed to disk, index data will have a different life-cycle than base data e.g. the index table may be compacted independently of base table compaction.The local read path for native secondary index is quite straightforward. First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data.Use CasesRestricting the query to a single server.All secondary index implementations work best when Cassandra can narrow down the number of nodes to querySecondary indexes can be very helpful in analytics workloads (Spark batch jobs) where you don’t have an SLA that’s measured in milliseconds.Anti-PatternsSecondary Indexes should not be used on columns that have high cardinality, a large number of unique values. Additionally, columns that have extremely low cardinality, such as a column storing booleans, are also not going to be particularly useful. Secondary indexes should not be used on tables that are frequently updated. Interestingly, Cassandra does not eliminate tombstones beyond 100 thousand cells. Once the tombstone limit is reached a query using the indexed value will fail. Secondary indexes should also be avoided in looking for values contained in a large partition unless the query is very narrow.Problems and LimitationsSecondary Indexes do not support ranged queries ( WHERE Age > 18 ). They can only be used on equality queries. Also, maintaining indexes through hidden tables means they are going through a separate compaction process. . Independently compacting sstables and indexes means the location of the data and the index information are completely decoupled. If the data is compacted, a new sstable is written, and our index is now incorrect. This means we can’t simply point to a location on disk in an index because the location of the data can change.SASI IndexesThere are two types of secondary indexes. Regular secondary index (2i) that uses hash tables to index data and supports equality (=) predicates. SSTable-attached secondary index (SASI) is an experimental and more efficient secondary index that uses B+ trees to index data and can support equality (=), inequality (<, <=, >, >=) and even text pattern matching (LIKE). However, SASI indexes are not currently supported in production. ResourcesSpecial thanks to Anil Mittana for putting together his presentation and speaking at Cassandra Lunch #57.https://docs.datastax.com/Cassandra.LinkCassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email! Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra

Illustration Image

In Cassandra Lunch #57: Using Secondary Indexes in Cassandra, guest speaker Anil Mittana presented on using Secondary Indexes in Cassandra. This blog post is to give an overview of the presentation. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Secondary Indexes in Cassandra

In this blog post, we will be writing an overview of the presentation given by Anil Mittana covering secondary indexes in Cassandra.

Query First Approach

In Cassandra, tables are created with the intention of facilitating future queries. For example, the following command in CQL would create a Table with the intention of querying for a rating by movie title. We won’t dive into the details here as this blog post is intended to be about the secondary indexes.

Image of a CQL statement creating a table with the intention of searching for ratings of a movie by movie title.

Secondary Indexes

Secondary Indexes are meant to help facilitate queries on columns in a table that are not a part of the primary key. However, they can cause performance issues, especially if a query needs to access multiple nodes, and should be applied carefully. They should only be used on columns or tables that have low cardinality, do not contain a counter, are infrequently updated, or tables that do not have large partitions.

How Cassandra Stores an Index

When indexes are created, a hidden table is created in a background process. To query a secondary index the partition key and secondary index column should be included in order to be successful. By including the partition key and the secondary index column only one node will need to be queried.

Secondary Index table comparison to a regular Cassandra table.

Distributed Index vs. Local Index

Tables and materialized views are examples of distributed indexing. A table or view data structure is distributed across all nodes in a cluster based on a partition key. When retrieving data using a partition key, Cassandra knows exactly which replica nodes may contain the result. For example, given a 100-node cluster with the replication factor of 5, at most 5 replica nodes and 1 coordinator node are needed to participate in a query.
In contrast, secondary indexes are examples of local indexing. A secondary index is represented by many independent data structures that index data stored on each node. When retrieving data using only an indexed column, Cassandra has no way to determine which nodes may have necessary data and has to query all nodes in a cluster. For example, given a 100-node cluster with any replication factor, all 100 nodes have to search their local index data structures. This does not scale well.

Write and Read paths of Secondary Indexes

Whenever a mutation, writing to a table, is applied to a base table in memory (memtable), it is dispatched as notification to all registered indices on this table so that each index implementation can apply the necessary processing. Index memtable and base memtable will generally be flushed to SSTables at the same time but there is no strong guarantee of this behavior. Once flushed to disk, index data will have a different life-cycle than base data e.g. the index table may be compacted independently of base table compaction.

The local read path for native secondary index is quite straightforward. First Cassandra reads the index table to retrieve the primary key of all matching rows and for each of them, it will read the original table to fetch out the data.

Use Cases

Restricting the query to a single server.
All secondary index implementations work best when Cassandra can narrow down the number of nodes to query
Secondary indexes can be very helpful in analytics workloads (Spark batch jobs) where you don’t have an SLA that’s measured in milliseconds.

Anti-Patterns

Secondary Indexes should not be used on columns that have high cardinality, a large number of unique values. Additionally, columns that have extremely low cardinality, such as a column storing booleans, are also not going to be particularly useful. Secondary indexes should not be used on tables that are frequently updated. Interestingly, Cassandra does not eliminate tombstones beyond 100 thousand cells. Once the tombstone limit is reached a query using the indexed value will fail. Secondary indexes should also be avoided in looking for values contained in a large partition unless the query is very narrow.

Problems and Limitations

Secondary Indexes do not support ranged queries ( WHERE Age > 18 ). They can only be used on equality queries. Also, maintaining indexes through hidden tables means they are going through a separate compaction process. . Independently compacting sstables and indexes means the location of the data and the index information are completely decoupled. If the data is compacted, a new sstable is written, and our index is now incorrect. This means we can’t simply point to a location on disk in an index because the location of the data can change.

SASI Indexes

There are two types of secondary indexes. Regular secondary index (2i) that uses hash tables to index data and supports equality (=) predicates. SSTable-attached secondary index (SASI) is an experimental and more efficient secondary index that uses B+ trees to index data and can support equality (=), inequality (<, <=, >, >=) and even text pattern matching (LIKE). However, SASI indexes are not currently supported in production.

Resources

Special thanks to Anil Mittana for putting together his presentation and speaking at Cassandra Lunch #57.

https://docs.datastax.com/

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Related Articles

lucene
cassandra
search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

John Doe

8/9/2022

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra.lunch