1/30/2019

Reading time:3 min

Cassandra row count tips – A developer day – Web Developer from Somewhere

by John Doe

When you use DevCenter, you define a connection with one or more server.Thus, when you do an CQL instruction, this one is done on the cluster with the default consistency level ( the consistency level defaults to ONE for all write and read operations),and so on the closest replica respond.When you use Cqlsh, in the same way your CQL instruction are done on the cluster (and not on the node on which you execute the cqlsh script).A simple example :Create a cluster of two nodes (update your cassandra.yaml), we call them nodeA and nodeB Start cqlsh on nodeA Create a keyspace (with a replication factor of 2), thus all your node have all datasCREATE KEYSPACE mybeautifulkeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2};CREATE a simple table on this keyspaceUSE mybeautifulkeyspace;CREATE TABLE mysimpletable (id int, label text, PRIMARY KEY(id));Insert a row in this tableINSERT INTO mysimpletable (id, label) values (1,'un libelle');you can start cqlsh on nodeB, and do a select :USE mybeautifulkeyspace;SELECT * FROM mysimpletable;your row is correctly returned the tricky things come here, stop cassandra on nodeB insert a new line on nodeA and stop cassandra on nodeA restart cassandra on nodeB, and connect via cqlsh to the keyspace mybeautifulkeyspace, the select return only one rows. id | label----+------------ 1 | un libellerestart cassandra on nodeA on nodeB, always with cqlsh do the select, it could take before the cassandra on nodeB could answer (after restarting), but after few select on nodeB the second line appears too id | label----+-------------- 1 | un libelle 2 | un libelle 2Ok, this is great, cqlsh get correctly datas from nodeA, but if I stop again my nodeA, what happen ?The select let appear the second line on nodeB, because the node do a read repair.On the cluster :select count(*) from mysimpletableHowever, there is a default limit of 10,000 applied to this statement which will truncate the result for larger tables. The limit can be increased :select count(*) from mysimpletable limit 1000000On a specific node :Or you could use nodetool, but it’s only estimated value (this information disappeared in Cassandra 2.1)>nodetool cfstats mybeautifulkeyspaceKeyspace: mybeautifulkeyspace Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Table: mysimpletable SSTable count: 1 Space used (live), bytes: 4734 Space used (total), bytes: 4734 SSTable Compression Ratio: 0.5972222222222222 Number of keys (estimate): 128 Memtable cell count: 0 Memtable data size, bytes: 0 Memtable switch count: 0 Local read count: 0 Local read latency: 0,000 ms Local write count: 0 Local write latency: 0,000 ms Pending tasks: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0,00000 Bloom filter space used, bytes: 16 Compacted partition minimum bytes: 61 Compacted partition maximum bytes: 86 Compacted partition mean bytes: 79 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0For my two rows the estimated value is 128 …Or other option is to use the sstablekeys that allows to get the keys from sstable file (generaly in /var/lib/cassandra/data butit could be specified in cassandra.yaml).An example of usage on this bash script :#!/bin/bash#prend en parametre le nom de la tableif [ -z "$1" ] then echo "fournir le nom d'une table en parametre" exit 1finodetool flush#recuperation de tout les fichiers data correspondantfor file in `ls /data/cassandra/data/mybeautifulkeyspace/$1/*-$1-*-Data.db`do echo "operating $file" sstablekeys $file >> output.txtdonenbLigne=$(sort output.txt | uniq | wc -l)rm output.txtecho "nombre de ligne $nbLigne"Deletes in Cassandra rely on Tombstones to support the Eventual Consistency model. More information hereTombstones are markers that can exist at different levels of the data model and let the cluster know that a delete was recored on a replica, and when it happened.Tombstones then play a role in keeping deleted data hidden and help with freeing space used by deleted columns on disk.The CQL count query doesn’t count them,Instead the compaction process reconciles the data in multiple SSTables on disk. The row fragments from each SSTable are collated and columns with the same name reconciled using the process we’ve already seen. The result of the compaction is a single SSTable that contains the same “truth” as the input files, but may be considerably smaller due to reconciling overwrites and deletions.So it was possible that the count by sstable count these deleted rows until the compaction is done.

Read this article if you want to know more about Cassandra row count tips – A developer day – Web Developer from Somewhere

When you use DevCenter, you define a connection with one or more server.
Thus, when you do an CQL instruction, this one is done on the cluster with the default consistency level ( the consistency level defaults to ONE for all write and read operations),
and so on the closest replica respond.

When you use Cqlsh, in the same way your CQL instruction are done on the cluster (and not on the node on which you execute the cqlsh script).

A simple example :

Create a cluster of two nodes (update your cassandra.yaml), we call them nodeA and nodeB
Start cqlsh on nodeA
Create a keyspace (with a replication factor of 2), thus all your node have all datas

CREATE KEYSPACE mybeautifulkeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2};

CREATE a simple table on this keyspace

USE mybeautifulkeyspace;
CREATE TABLE mysimpletable (id int, label text, PRIMARY KEY(id));

Insert a row in this table

INSERT INTO mysimpletable (id, label) values (1,'un libelle');

you can start cqlsh on nodeB, and do a select :

USE mybeautifulkeyspace;
SELECT * FROM mysimpletable;

your row is correctly returned
the tricky things come here, stop cassandra on nodeB
insert a new line on nodeA and stop cassandra on nodeA
restart cassandra on nodeB, and connect via cqlsh to the keyspace mybeautifulkeyspace, the select return only one rows.

 id | label
----+------------
  1 | un libelle

restart cassandra on nodeA
on nodeB, always with cqlsh do the select, it could take before the cassandra on nodeB could answer (after restarting), but after few select on nodeB the second line appears too

 id | label
----+--------------
  1 |   un libelle
  2 | un libelle 2

Ok, this is great, cqlsh get correctly datas from nodeA, but if I stop again my nodeA, what happen ?
The select let appear the second line on nodeB, because the node do a read repair.

On the cluster :

select count(*) from mysimpletable

However, there is a default limit of 10,000 applied to this statement which will truncate the result for larger tables. The limit can be increased :

select count(*) from mysimpletable limit 1000000

On a specific node :

Or you could use nodetool, but it’s only estimated value (this information disappeared in Cassandra 2.1)

>nodetool cfstats mybeautifulkeyspace
Keyspace: mybeautifulkeyspace
        Read Count: 0
        Read Latency: NaN ms.
        Write Count: 0
        Write Latency: NaN ms.
        Pending Tasks: 0
                Table: mysimpletable
                SSTable count: 1
                Space used (live), bytes: 4734
                Space used (total), bytes: 4734
                SSTable Compression Ratio: 0.5972222222222222
                Number of keys (estimate): 128
                Memtable cell count: 0
                Memtable data size, bytes: 0
                Memtable switch count: 0
                Local read count: 0
                Local read latency: 0,000 ms
                Local write count: 0
                Local write latency: 0,000 ms
                Pending tasks: 0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0,00000
                Bloom filter space used, bytes: 16
                Compacted partition minimum bytes: 61
                Compacted partition maximum bytes: 86
                Compacted partition mean bytes: 79
                Average live cells per slice (last five minutes): 0.0
                Average tombstones per slice (last five minutes): 0.0

For my two rows the estimated value is 128 …

Or other option is to use the sstablekeys that allows to get the keys from sstable file (generaly in /var/lib/cassandra/data but
it could be specified in cassandra.yaml).

An example of usage on this bash script :

#!/bin/bash
#prend en parametre le nom de la table
if [ -z "$1" ]
  then echo "fournir le nom d'une table en parametre"
  exit 1
fi
nodetool flush
#recuperation de tout les fichiers data correspondant
for file in `ls /data/cassandra/data/mybeautifulkeyspace/$1/*-$1-*-Data.db`
do
  echo "operating $file"
  sstablekeys $file >> output.txt
done
nbLigne=$(sort output.txt | uniq | wc -l)
rm output.txt
echo "nombre de ligne $nbLigne"

Deletes in Cassandra rely on Tombstones to support the Eventual Consistency model. More information here
Tombstones are markers that can exist at different levels of the data model and let the cluster know that a delete was recored on a replica, and when it happened.
Tombstones then play a role in keeping deleted data hidden and help with freeing space used by deleted columns on disk.
The CQL count query doesn’t count them,

Instead the compaction process reconciles the data in multiple SSTables on disk. The row fragments from each SSTable are collated and columns with the same name reconciled using the process we’ve already seen. The result of the compaction is a single SSTable that contains the same “truth” as the input files, but may be considerably smaller due to reconciling overwrites and deletions.
So it was possible that the count by sstable count these deleted rows until the compaction is done.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

On the cluster :

On a specific node :

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us