I have a table defined like:
Keyspace:
CREATE KEYSPACE messages WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
Table:
CREATE TABLE messages.textmessages (
categoryid int,
date timestamp,
messageid timeuuid,
message text,
userid int,
PRIMARY KEY ((categoryid, date), messageid)
) WITH CLUSTERING ORDER BY (messageid ASC);
The goal is to have a wide row time-series storage such that categoryid
and date
(beginning of day) constitutes my partition key and the messageid
provides the clustering. This enables me to do queries like:
SELECT * FROM messages.textmessages WHERE categoryid=2 AND date='2019-05-14 00:00:00.000+0300' AND messageId > maxTimeuuid('2019-05-14 00:00:00.000+0300') AND messageId < minTimeuuid('2019-05-15 00:00:00.000+0300')
to get messages in a given day; it works so well so fast!
Problem
I need to be able to count the messages in a given day by substituting SELECT *
above with SELECT COUNT(*)
. This takes very long even with a little less than 100K entries in the column family; it actually times out on cqlsh
.
I have read and understood quite a bit why COUNT
is an expensive operation for a distributed database like Cassandra in Counting keys? Might as well be counting stars
Question
Why would this query take so long even when:
SELECT COUNT(*) FROM messages.textmessages WHERE categoryid=2 AND date='2019-05-14 00:00:00.000+0300' AND messageId > maxTimeuuid('2019-05-14 00:00:00.000+0300') AND messageId < minTimeuuid('2019-05-15 00:00:00.000+0300')
- The count is on a specific partition with less than 100K records
- I have only one Cassandra node on a performant Macbook Pro
- No active writes/reads in the instance; less than 20 partitions on development laptop