Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

1/25/2019

Reading time:8 min

Querying Map-like Data in DSE Search

by John Doe

DSE search doesn’t yet support indexing frozen maps. When people have map-like data and would like to use Solr Query, how could we make it possible? In this blog, we demonstrate two approaches.SetupLet’s create a single node cluster, workload as “Search”. The demo cluster uses DSE 5.1.1:Cluster$ dsetool ringAddress          DC             Rack         Workload      Graph  Status  State     Load       Owns                 VNodes           10.200.177.142   DC1           rack1        Search           no     Up      Normal    393.99 MiB        ?                    16                                Note: you must specify a keyspace to get ownership information.Table Schema With MapWe have an original table schema with a map collection, and we’d like to run a Solr query against it. This is a table of book metadata, the book id (an unique uuid), the book genre, type and the status of publication.CREATE TABLE demodb.books_orig (book_id uuid PRIMARY KEY, tags map);Let’s insert some sample data:cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Drama', 'type': 'fiction', 'is_published': 'true'});cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Guide', 'type': 'non_fiction', 'is_published': 'true'});cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Travel', 'type': 'non_fiction', 'is_published': 'true'});cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Mystery', 'type': 'fiction', 'is_published': 'false'});How do we proceed from here? According to the documentation, CQL map collections map to Search dynamic fields. What does it mean?  Let’s roughly describe what a dynamic field is in DSE Search. It is derived from Apache Solr dynamic field, which is named by using a prefix or suffix wildcard in schema.xml. In general, dynamic fields in Solr are used to index fields that are not defined explicitly by schema. DSE Search also uses it to index a CQL map collection. Using a dynamic field requires us to insert data into the map collection with keys that include its base name as a prefix or suffix, depending on the field. We have a demo below.Using dynamic fieldsLet’s modify the CQL schema slightly to align with the naming convention for dynamic fields:CREATE TABLE demodb.books_dynamic_fld (book_id uuid PRIMARY KEY, tags_ map );As described above, we must insert the data to include the map collection name as the base name, prefixing or suffixing in each map pair. We use prefix in this example:cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8, {'tags_genre':'Guide', 'tags_is_published': 'true', 'tags_type': 'non_fiction' });cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620, {'tags_genre':'Travel', 'tags_is_published': 'true', 'tags_type': 'non_fiction' });cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 2b6fa659-6912-45b3-afbf-69b1b97d75d0, {'tags_genre':'Drama', 'tags_is_published': 'true', 'tags_type': 'fiction' });cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 866b493c-78da-4550-afb6-cd6f16a6044e, {'tags_genre':'Mystery', 'tags_is_published': 'false', 'tags_type': 'fiction' });Create solr core:There are three ways to create Solr cores:The legacy “curl” http request for  CREATE_CORE;“dsetool create_core”;Since DSE 5.1, there is a new syntax, “CREATE SEARCH INDEX”, that supports Solr core creation over CQL.Note: It’s recommended to use dsetool or CQL statements over the legacy way of “curl/http” for core management including create, reload and unload, etc.We use the new syntax of “CREATE SEARCH INDEX”:$ cqlsh> CREATE SEARCH INDEX ON demodb.books_dynamic_fld;When there are no existing resources, nor COLUMNS, PROFILES or CONFIG option provided in the CQL statement, new resources will be auto-generated with the default values. The above CQL statement has the same effect as “generateResources=true reindex=true” in dsetool create_core.Solr schema:<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.TextField" name="TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/> </types> <fields> <dynamicField indexed="true" multiValued="false" name="tags_*" stored="true" type="TextField"/> <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/> </fields> <uniqueKey>book_id</uniqueKey> </schema>Notice that the CQL text field is mapped to a Solr TextField in the schema.xml, which is both tokenized and filtered before indexing, making token-based searching possible. If this is not necessary, and we're indexing simple strings, then the lower-overhead Solr type StrField would be preferable.Note: In the future DSE 6.0, a CQL text field will be mapped to a Solr StrField by default in auto-generated schema.xml.In this example, the text contents are a few single units without tokenizing. Let’s change them to StrField.Add a new field type:cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld ADD types.fieldType[@name='StrField',@class='solr.StrField'];Set the dynamic field to use StrField:cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld SET dynamicField[@name='tags_*']@type = 'StrField';Since there is no any field referring to field type “TextField”, we drop it.cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld DROP fieldType[@name='TextField'];Reload the search index:cqlsh> RELOAD SEARCH INDEX ON demodb.books_dynamic_fld;Rebuild the search index. Given that we change the field type, we use full reindex with “deleteAll:true” in this example:cqlsh> REBUILD SEARCH INDEX ON demodb.books_dynamic_fld WITH OPTIONS { deleteAll:true };New schema.xml:<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.StrField" name="StrField" /> <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/> </types> <fields> <dynamicField indexed="true" multiValued="false" name="tags_*" stored="true" type="StrField"/> <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/> </fields> <uniqueKey>book_id</uniqueKey> </schema>Let’s search …“Find all fiction books”:cqlsh:demodb> select * from books_dynamic_fld  where solr_query = '{"q": "tags_type:fiction"}';   book_id                           | solr_query | tags_ -----------------------------------+------------+---------------------2b6fa659-6912-45b3-afbf-69b1b97d75d0 |       null |    {'tags_genre': 'Drama', 'tags_is_published': 'true', 'tags_type': 'fiction'}  866b493c-78da-4550-afb6-cd6f16a6044e |       null | {'tags_genre': 'Mystery', 'tags_is_published': 'false', 'tags_type': 'fiction'}(2 rows)Using UDT and TuplesBeside using dynamic field, there is another way to index and query the map-like data. Let’s see the next example.We need to create an UDT for the tags we defined:CREATE TYPE demodb.tag (genre text, is_published boolean, type text);CQL table schema:CREATE TABLE demodb.books_udt (book_id uuid PRIMARY KEY, tags frozen );Note: Although the frozen type is used in this example, a non-frozen UDT is supported from DSE 5.1 onward.Inserting data as we did for the original table:cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8, {"genre": 'Guide', "is_published": true, "type": 'non_fiction'});cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620, {"genre": 'Travel', "is_published": true, "type": 'non_fiction'});cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 2b6fa659-6912-45b3-afbf-69b1b97d75d0, {"genre": 'Drama', "is_published": true, "type": 'fiction'});cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 866b493c-78da-4550-afb6-cd6f16a6044e, {"genre": 'Mystery', "is_published": false, "type": 'fiction'});Similar to the previous demo for dynamic field, we create the search index with auto-generated resources, and are going to use Solr StrField for CQL text field. To achieve that, let’s include the PROFILES in the CQL statement and apply the space saving options, “spaceSavingNoTextfield”, to use StrField on initial creation.cqlsh:demodb> CREATE SEARCH INDEX ON demodb.books_udt WITH PROFILES spaceSavingNoTextfield;Let’s see the Solr schema.xml generated:<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/> <fieldType class="com.datastax.bdp.search.solr.core.types.TupleField" name="TupleField"/> <fieldType class="org.apache.solr.schema.StrField" name="StrField" /> <fieldType class="org.apache.solr.schema.BoolField" name="BoolField"/> </types> <fields> <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/> <field indexed="true" multiValued="false" name="tags" stored="true" type="TupleField"/> <field indexed="true" multiValued="false" name="tags.genre" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="tags.is_published" stored="true" type="BoolField"/> <field indexed="true" multiValued="false" name="tags.type" stored="true" type="StrField"/> </fields> <uniqueKey>book_id</uniqueKey> </schema>We will search again:“I would like to get all published books”:cqlsh> use demodb;cqlsh:demodb> select * from books_udt where solr_query = '{"q":"{!tuple}tags.is_published:true"}';   book_id               | solr_query | tags --------------------------------------+------------+----------------------------7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8 |       null |  {genre: 'Guide', is_published: 'true', type: 'non_fiction'}  6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620 |       null | {genre: 'Travel', is_published: 'true', type: 'non_fiction'}  2b6fa659-6912-45b3-afbf-69b1b97d75d0 |       null |      {genre: 'Drama', is_published: 'true', type: 'fiction'}  (3 rows)Which method should we choose?Between these two methods, which one should be chosen? That depends on the data in your map collection.If this is a map with static set of keys, both ways work.If the map contains large amount of elements or the size of the map is unknown, there could be schema overhead by using UDT. Please choose dynamic field instead.Please note, if the size of map grows widely, for example, millions of keys, the sheer size of the Solr schema may put pressure on the heap. Neither method will work. Please reconsider your data model in this case.Whatever you do, please test it, functionality-wise and performance-wise!Should you have any questions, please feel free to contact Datastax support. 

Illustration Image

DSE search doesn’t yet support indexing frozen maps. When people have map-like data and would like to use Solr Query, how could we make it possible? In this blog, we demonstrate two approaches.

Setup

Let’s create a single node cluster, workload as “Search”. The demo cluster uses DSE 5.1.1:

Cluster

$ dsetool ring
Address          DC             Rack         Workload      Graph  Status    State          Load            Owns                 VNodes           
10.200.177.142   DC1           rack1        Search           no     Up      Normal      393.99 MiB          ?                    16                                
Note: you must specify a keyspace to get ownership information.

Table Schema With Map

We have an original table schema with a map collection, and we’d like to run a Solr query against it. This is a table of book metadata, the book id (an unique uuid), the book genre, type and the status of publication.

CREATE TABLE demodb.books_orig (book_id uuid PRIMARY KEY, tags map);

Let’s insert some sample data:

cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Drama', 'type': 'fiction', 'is_published': 'true'});
cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Guide', 'type': 'non_fiction', 'is_published': 'true'});
cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Travel', 'type': 'non_fiction', 'is_published': 'true'});
cqlsh:demodb> INSERT INTO books_orig (book_id , tags ) VALUES ( uuid(), {'genre' :'Mystery', 'type': 'fiction', 'is_published': 'false'});

How do we proceed from here? According to the documentation, CQL map collections map to Search dynamic fields. What does it mean?  

Let’s roughly describe what a dynamic field is in DSE Search. It is derived from Apache Solr dynamic field, which is named by using a prefix or suffix wildcard in schema.xml. In general, dynamic fields in Solr are used to index fields that are not defined explicitly by schema. DSE Search also uses it to index a CQL map collection. Using a dynamic field requires us to insert data into the map collection with keys that include its base name as a prefix or suffix, depending on the field. We have a demo below.

Using dynamic fields

Let’s modify the CQL schema slightly to align with the naming convention for dynamic fields:

CREATE TABLE demodb.books_dynamic_fld (book_id uuid PRIMARY KEY, tags_ map );

As described above, we must insert the data to include the map collection name as the base name, prefixing or suffixing in each map pair. We use prefix in this example:

cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8, {'tags_genre':'Guide', 'tags_is_published': 'true', 'tags_type': 'non_fiction' });
cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620, {'tags_genre':'Travel', 'tags_is_published': 'true', 'tags_type': 'non_fiction' });
cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 2b6fa659-6912-45b3-afbf-69b1b97d75d0, {'tags_genre':'Drama', 'tags_is_published': 'true', 'tags_type': 'fiction' });
cqlsh:demodb> INSERT INTO demodb.books_dynamic_fld (book_id , tags_) VALUES ( 866b493c-78da-4550-afb6-cd6f16a6044e, {'tags_genre':'Mystery', 'tags_is_published': 'false', 'tags_type': 'fiction' });

Create solr core:

There are three ways to create Solr cores:

  • The legacy “curl” http request for  CREATE_CORE;
  • “dsetool create_core”;
  • Since DSE 5.1, there is a new syntax, “CREATE SEARCH INDEX”, that supports Solr core creation over CQL.

Note: It’s recommended to use dsetool or CQL statements over the legacy way of “curl/http” for core management including create, reload and unload, etc.

We use the new syntax of “CREATE SEARCH INDEX”:

$ cqlsh> CREATE SEARCH INDEX ON demodb.books_dynamic_fld;

When there are no existing resources, nor COLUMNS, PROFILES or CONFIG option provided in the CQL statement, new resources will be auto-generated with the default values. The above CQL statement has the same effect as “generateResources=true reindex=true” in dsetool create_core.

Solr schema:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<schema name="autoSolrSchema" version="1.5">   
  <types>     
    <fieldType class="org.apache.solr.schema.TextField" name="TextField">       
     <analyzer>         
      <tokenizer class="solr.StandardTokenizerFactory"/>         
      <filter class="solr.LowerCaseFilterFactory"/>       
     </analyzer>     
    </fieldType>     
    <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>   
  </types>   
<fields>     
  <dynamicField indexed="true" multiValued="false" name="tags_*" stored="true" type="TextField"/>     
  <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/>   
</fields>   
<uniqueKey>book_id</uniqueKey> 
</schema>

Notice that the CQL text field is mapped to a Solr TextField in the schema.xml, which is both tokenized and filtered before indexing, making token-based searching possible. If this is not necessary, and we're indexing simple strings, then the lower-overhead Solr type StrField would be preferable.

Note: In the future DSE 6.0, a CQL text field will be mapped to a Solr StrField by default in auto-generated schema.xml.

In this example, the text contents are a few single units without tokenizing. Let’s change them to StrField.

  • Add a new field type:

cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld ADD types.fieldType[@name='StrField',@class='solr.StrField'];
  • Set the dynamic field to use StrField:

cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld SET dynamicField[@name='tags_*']@type = 'StrField';
  • Since there is no any field referring to field type “TextField”, we drop it.

cqlsh> ALTER SEARCH INDEX SCHEMA ON demodb.books_dynamic_fld DROP fieldType[@name='TextField'];
  • Reload the search index:

cqlsh> RELOAD SEARCH INDEX ON demodb.books_dynamic_fld;
  • Rebuild the search index. Given that we change the field type, we use full reindex with “deleteAll:true” in this example:

cqlsh> REBUILD SEARCH INDEX ON demodb.books_dynamic_fld WITH OPTIONS { deleteAll:true };

New schema.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<schema name="autoSolrSchema" version="1.5">   
  <types>     
    <fieldType class="org.apache.solr.schema.StrField" name="StrField" />       
    <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>   
  </types>   
<fields>     
  <dynamicField indexed="true" multiValued="false" name="tags_*" stored="true" type="StrField"/>     
  <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/>   
</fields>   
<uniqueKey>book_id</uniqueKey> 
</schema>

Let’s search …

“Find all fiction books”:

cqlsh:demodb> select * from books_dynamic_fld  where solr_query = '{"q": "tags_type:fiction"}';   
book_id                              | solr_query |       tags_ -----------------------------------+------------+---------------------
2b6fa659-6912-45b3-afbf-69b1b97d75d0 |       null |    {'tags_genre': 'Drama', 'tags_is_published': 'true', 'tags_type': 'fiction'}  
866b493c-78da-4550-afb6-cd6f16a6044e |       null | {'tags_genre': 'Mystery', 'tags_is_published': 'false', 'tags_type': 'fiction'}
(2 rows)

Using UDT and Tuples

Beside using dynamic field, there is another way to index and query the map-like data. Let’s see the next example.

We need to create an UDT for the tags we defined:

CREATE TYPE demodb.tag (genre text, is_published boolean, type text);

CQL table schema:

CREATE TABLE demodb.books_udt (book_id uuid PRIMARY KEY, tags frozen );

Note: Although the frozen type is used in this example, a non-frozen UDT is supported from DSE 5.1 onward.

Inserting data as we did for the original table:

cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8, {"genre": 'Guide', "is_published": true, "type": 'non_fiction'});
cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620, {"genre": 'Travel', "is_published": true, "type": 'non_fiction'});
cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 2b6fa659-6912-45b3-afbf-69b1b97d75d0, {"genre": 'Drama', "is_published": true, "type": 'fiction'});
cqlsh:demodb> INSERT INTO books_udt (book_id , tags ) VALUES ( 866b493c-78da-4550-afb6-cd6f16a6044e, {"genre": 'Mystery', "is_published": false, "type": 'fiction'});

Similar to the previous demo for dynamic field, we create the search index with auto-generated resources, and are going to use Solr StrField for CQL text field. To achieve that, let’s include the PROFILES in the CQL statement and apply the space saving options, “spaceSavingNoTextfield”, to use StrField on initial creation.

cqlsh:demodb> CREATE SEARCH INDEX ON demodb.books_udt WITH PROFILES spaceSavingNoTextfield;

Let’s see the Solr schema.xml generated:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<schema name="autoSolrSchema" version="1.5">   
  <types>     
    <fieldType class="org.apache.solr.schema.UUIDField" name="UUIDField"/>     
    <fieldType class="com.datastax.bdp.search.solr.core.types.TupleField" name="TupleField"/>     
    <fieldType class="org.apache.solr.schema.StrField" name="StrField" />        
    <fieldType class="org.apache.solr.schema.BoolField" name="BoolField"/> 
</types>   
<fields>     
  <field docValues="true" indexed="true" multiValued="false" name="book_id" stored="true" type="UUIDField"/>     
  <field indexed="true" multiValued="false" name="tags" stored="true" type="TupleField"/>
  <field indexed="true" multiValued="false" name="tags.genre" stored="true" type="StrField"/>     
  <field indexed="true" multiValued="false" name="tags.is_published" stored="true" type="BoolField"/>     
  <field indexed="true" multiValued="false" name="tags.type" stored="true" type="StrField"/>   
</fields>   
<uniqueKey>book_id</uniqueKey> 
</schema>

We will search again:

“I would like to get all published books”:

cqlsh> use demodb;
cqlsh:demodb> select * from books_udt where solr_query = '{"q":"{!tuple}tags.is_published:true"}';   
book_id                              | solr_query | tags --------------------------------------+------------+----------------------------
7fc4aab8-b469-4f0b-b7dd-bc9c922b9ef8 |       null |  {genre: 'Guide', is_published: 'true', type: 'non_fiction'}  
6f5cbaa7-eec2-41bb-9668-d7e1ee6b0620 |       null | {genre: 'Travel', is_published: 'true', type: 'non_fiction'}  
2b6fa659-6912-45b3-afbf-69b1b97d75d0 |       null |      {genre: 'Drama', is_published: 'true', type: 'fiction'}  
(3 rows)

Which method should we choose?

Between these two methods, which one should be chosen? That depends on the data in your map collection.

  • If this is a map with static set of keys, both ways work.
  • If the map contains large amount of elements or the size of the map is unknown, there could be schema overhead by using UDT. Please choose dynamic field instead.
  • Please note, if the size of map grows widely, for example, millions of keys, the sheer size of the Solr schema may put pressure on the heap. Neither method will work. Please reconsider your data model in this case.
  • Whatever you do, please test it, functionality-wise and performance-wise!

Should you have any questions, please feel free to contact Datastax support.

 

Related Articles

kubernetes
datastax
cassandra

Apache Cassandra Lunch #78: Cass Operator - Business Platform Team

Stefan Nikolovski

7/1/2022

cassandra
dse

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra