7/13/2017

Reading time:5 min

DSE Search (Apache Cassandra + Solr) Deployment Guide

by John Doe

DSE Search (Apache Cassandra + Solr) Deployment GuideLet me guess, you deployed DSE with just Cassandra enabled, realized you needed additional search capabilities, and restarted your nodes as DSE Search nodes with Apache Solr enabled. Should work fine, right?Enabling Solr does offer a whole new set of features, including partial text search, facet queries, geospatial, etc, but DSE Search also means different hardware requirement, specific tuning, and TLC, also known as repairs.Follow these 6 steps and save yourself some pain (based on DSE 4.8: Apache Cassandra 2.1 / Apache Solr 4.10).Remember, all nodes need to have Solr-enabled within the DC. Also, make sure your connection object in your application is set to “DCAwareRoundRobinPolicy” to the DSE Search DC. You can only run solr_query on a Solr-enabled DC.Environment1. Hardware: more memory, more coresIf you are running DSE with only Cassandra enabled, 16-core CPU, 32GB RAM and SSDs is enough to get started.If you plan to use DSE Search, 16-core CPU is the bare minimum. 24-core+ is ideal. Indexing is CPU intensive!Memory is also key since you want your search index to fit in memory if you want to meet your SLAs (think performance). 128GB+ if you can. How can you know your index size?Plan to leverage appropriately sized hardware, especially with memory, if you want to be successful with DSE Search.2. DSE Search Configuration2.1 vnodesIMPORTANT: If you have vnodes enabled, make sure you set “num_tokens” to “32” (cassandra.yaml) before starting DSE; otherwise, you could pay a performance hit of up to 30%. You cannot change this easily after starting DSE so hopefully you read this before deploying your nodes.2.2 dse.yaml1. Increase concurrency, ie how many physical cores are allocated for indexing, by setting max_solr_concurrency_per_core to the number of available CPU cores divided by the number of Solr cores. By increasing indexing throughput, you increase the number of indexing cores per Solr coremax_solr_concurrency_per_core = CPU core / # Solr core2. Adjust back-pressure: this keeps nodes responsive without falling over. Increase back_pressure_threshold_per_core to 1,000 x number of Solr cores. This represents the number of indexing requests in queue per Solr core.back_pressure_threshold_per_core = 2000–10000General guide, if your index queue backs up:1. increase AutoSoftCommit > 10 minutes2. increase backpressure2.3 HEAPIncrease your HEAP to at least 20GB (especially if you have lots of RAM) in cassandra-env.sh if you are using G1GC:MAX_HEAP_SIZE=”20G”If you are using CMS, set MAX_HEAP=14G3. RepairsRUN REPAIRS! Solr queries have Consistency Level (CL) = Local_One + kick off read repair when run from CQL. If run from Solr HTTP API, CL=ONE + no read repair. You want the same result regardless on which node you run your query, don’t you? Simplest way to do this: OpsCenter.Data Model4. Partition KeyWho cares about your partition key, you now have the freedom of running any queries you ever wanted now that you have DSE Search. Well, ok but don’t throw away good data partitioning just because you might not have to think about denormalizing, duplication, and access pattern data modeling.Remember: keep it below 100MB partition size and 100,000 items.How can you tell you’re in the clear:1. Opscenter Best Practice Services has a “wide partition” validation2. nodetool cfstats3. system.log — they come up as “WARN .. Compacting large partition…” so look for them4.1 Collections: MapAll those “map” collection columns in your table will now be converted to dynamic fields in Solr. Why does it matter? Well: “Lucene allocates memory for each unique field (column) name, so if you have a row with columns A, B, C, and another row with B, D, E, Lucene allocates 5 chunks of memory. For millions of rows, the heap is unwieldy. Instead of using dynamic fields, use Copy fields instead and then perform queries against the combined field.” (DataStax docs)4.2 DecimalsCQL decimal type will be automatically mapped to Solr DecimalStrField, which is not very helpful since it will do a string comparison rather than numeric. Instead, store the field in CQL as decimal (as you did) and use Solr Copy field to map it to a TrieDouble. I know double != decimal but it’s the best work-around I’ve seen.Solr TuningThis was already well covered in this post but high level:5.1 Schema.xmlSave space, keep your Solr index smaller by using StrField (vs TextField) for fields you will not do partial text searches on. Also, you don’t HAVE to index all your columns (really). Set indexed=false in your schema.xml5.2 docValueIf you are executing facet queries, ie you want to “group by” a column such as event by status, or sorting, set docValues=trueThere are other parameters you can tune such as query warming, but these 2 are simple, common and apply to most DSE Search projects.Monitoring6.1 OpscenterYou can monitor DSE Search Request, Latency, Timeouts, Errors, Core Size from your Opscenter dashboard. You can also enable Best Practice Services for DSE Search.6.2 SolrAdmin UIYou want your index to fit in memory so how can you easily find out its size? Select your core from SolrAdmin UI:> plugins/stats -> CORE -> index size (or via Opscenter)6.3 Dropped MutationsLast recommendation: check dropped mutations, ie via Opscenter by adding the graph to the dashboard or by selecting a node so node details come up -> scroll down to bottom.This is also available via nodetool tpstats. This indicates that you are overloading your cluster and the nodes cannot write fast enough. Not a good thing if you want consistent Solr results.ConclusionThis is not meant to be a comprehensive guide to everything you should configure/monitor with DSE Search. This is just to get you started. There are other configuration and tuning you can do but these are the most common I have come across that will get you the most impact for simple changes.

Read this article if you want to know more about DSE Search (Apache Cassandra + Solr) Deployment Guide

DSE Search (Apache Cassandra + Solr) Deployment Guide

Let me guess, you deployed DSE with just Cassandra enabled, realized you needed additional search capabilities, and restarted your nodes as DSE Search nodes with Apache Solr enabled. Should work fine, right?

Enabling Solr does offer a whole new set of features, including partial text search, facet queries, geospatial, etc, but DSE Search also means different hardware requirement, specific tuning, and TLC, also known as repairs.

Follow these 6 steps and save yourself some pain (based on DSE 4.8: Apache Cassandra 2.1 / Apache Solr 4.10).

Remember, all nodes need to have Solr-enabled within the DC. Also, make sure your connection object in your application is set to “DCAwareRoundRobinPolicy” to the DSE Search DC. You can only run solr_query on a Solr-enabled DC.

Environment

1. Hardware: more memory, more cores

If you are running DSE with only Cassandra enabled, 16-core CPU, 32GB RAM and SSDs is enough to get started.

If you plan to use DSE Search, 16-core CPU is the bare minimum. 24-core+ is ideal. Indexing is CPU intensive!

Memory is also key since you want your search index to fit in memory if you want to meet your SLAs (think performance). 128GB+ if you can. How can you know your index size?

Plan to leverage appropriately sized hardware, especially with memory, if you want to be successful with DSE Search.

2. DSE Search Configuration

2.1 vnodes

IMPORTANT: If you have vnodes enabled, make sure you set “num_tokens” to “32” (cassandra.yaml) before starting DSE; otherwise, you could pay a performance hit of up to 30%. You cannot change this easily after starting DSE so hopefully you read this before deploying your nodes.

2.2 dse.yaml

1. Increase concurrency, ie how many physical cores are allocated for indexing, by setting max_solr_concurrency_per_core to the number of available CPU cores divided by the number of Solr cores. By increasing indexing throughput, you increase the number of indexing cores per Solr core

max_solr_concurrency_per_core = CPU core / # Solr core

2. Adjust back-pressure: this keeps nodes responsive without falling over. Increase back_pressure_threshold_per_core to 1,000 x number of Solr cores. This represents the number of indexing requests in queue per Solr core.

back_pressure_threshold_per_core = 2000–10000

General guide, if your index queue backs up:

1. increase AutoSoftCommit > 10 minutes

2. increase backpressure

2.3 HEAP

Increase your HEAP to at least 20GB (especially if you have lots of RAM) in cassandra-env.sh if you are using G1GC:

MAX_HEAP_SIZE=”20G”

If you are using CMS, set MAX_HEAP=14G

3. Repairs

RUN REPAIRS! Solr queries have Consistency Level (CL) = Local_One + kick off read repair when run from CQL. If run from Solr HTTP API, CL=ONE + no read repair. You want the same result regardless on which node you run your query, don’t you? Simplest way to do this: OpsCenter.

Data Model

4. Partition Key

Who cares about your partition key, you now have the freedom of running any queries you ever wanted now that you have DSE Search. Well, ok but don’t throw away good data partitioning just because you might not have to think about denormalizing, duplication, and access pattern data modeling.

Remember: keep it below 100MB partition size and 100,000 items.

How can you tell you’re in the clear:

1. Opscenter Best Practice Services has a “wide partition” validation

2. nodetool cfstats

3. system.log — they come up as “WARN .. Compacting large partition…” so look for them

4.1 Collections: Map

All those “map” collection columns in your table will now be converted to dynamic fields in Solr. Why does it matter? Well: “Lucene allocates memory for each unique field (column) name, so if you have a row with columns A, B, C, and another row with B, D, E, Lucene allocates 5 chunks of memory. For millions of rows, the heap is unwieldy. Instead of using dynamic fields, use Copy fields instead and then perform queries against the combined field.” (DataStax docs)

4.2 Decimals

CQL decimal type will be automatically mapped to Solr DecimalStrField, which is not very helpful since it will do a string comparison rather than numeric. Instead, store the field in CQL as decimal (as you did) and use Solr Copy field to map it to a TrieDouble. I know double != decimal but it’s the best work-around I’ve seen.

Solr Tuning

This was already well covered in this post but high level:

5.1 Schema.xml

Save space, keep your Solr index smaller by using StrField (vs TextField) for fields you will not do partial text searches on. Also, you don’t HAVE to index all your columns (really). Set indexed=false in your schema.xml

5.2 docValue

If you are executing facet queries, ie you want to “group by” a column such as event by status, or sorting, set docValues=true

There are other parameters you can tune such as query warming, but these 2 are simple, common and apply to most DSE Search projects.

Monitoring

6.1 Opscenter

You can monitor DSE Search Request, Latency, Timeouts, Errors, Core Size from your Opscenter dashboard. You can also enable Best Practice Services for DSE Search.

6.2 SolrAdmin UI

You want your index to fit in memory so how can you easily find out its size? Select your core from SolrAdmin UI:

> plugins/stats -> CORE -> index size (or via Opscenter)

6.3 Dropped Mutations

Last recommendation: check dropped mutations, ie via Opscenter by adding the graph to the dashboard or by selecting a node so node details come up -> scroll down to bottom.

This is also available via nodetool tpstats. This indicates that you are overloading your cluster and the nodes cannot write fast enough. Not a good thing if you want consistent Solr results.

Conclusion

This is not meant to be a comprehensive guide to everything you should configure/monitor with DSE Search. This is just to get you started. There are other configuration and tuning you can do but these are the most common I have come across that will get you the most impact for simple changes.

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

datastax

11/1/2024

migration

proxy

datastax

GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.

',d,a,t,a,s,t,a,x,'

11/1/2024

cloud

kubernetes

datastax

DataStax Hyper-Converged Database: The Future of Data Infrastructure Is Here | DataStax

Patrick McFadin

7/11/2024

cluster

troubleshooting

datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

data

cassandra

database

Data Structures and Types Explained

Zac Amos

1/5/2024

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

IBM

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

datastaxdevs

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

John Doe

12/2/2023

examples

cassandra

datastax

DataStax Examples

John Doe

12/2/2023

web.scraping

scraping

datastax

Build a Website Scraper with Astra DB + Python Examples | DataStax

John Doe

12/2/2023

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

database

data

cassandra

database

Data Structures and Types Explained

Zac Amos

1/5/2024