I'm doing a student project involving building and querying a Cassandra data cluster.
When my cluster load was light ( around 30GB ) my queries ran without a problem, but now that it's quite a bit bigger (1/2TB) my queries are timing out.
I thought that this problem might arise, so before I began generating and loading test data I had changed this value in my cassandra.yaml file:
request_timeout_in_ms (Default: 10000 ) The default timeout for other, miscellaneous operations.
However, when I changed that value to like 1000000, then cassandra seemingly hung on startup -- but that could've just been the large timeout at work.
My goal for data generation is 2TB. How do I query that large of space without running into timeouts?
queries :
SELECT huntpilotdn
FROM project.t1
WHERE (currentroutingreason, orignodeid, origspan,
origvideocap_bandwidth, datetimeorigination)
> (1,1,1,1,1)
AND (currentroutingreason, orignodeid, origspan,
origvideocap_bandwidth, datetimeorigination)
< (1000,1000,1000,1000,1000)
LIMIT 10000
ALLOW FILTERING;
SELECT destcause_location, destipaddr
FROM project.t2
WHERE datetimeorigination = 110
AND num >= 11612484378506
AND num <= 45880092667983
LIMIT 10000;
SELECT origdevicename, duration
FROM project.t3
WHERE destdevicename IN ('a','f', 'g')
LIMIT 10000
ALLOW FILTERING;
I have a demo keyspace with the same schemas, but a far smaller data size (~10GB) and these queries run just fine in that keyspace.
All these tables that are queried have millions of rows and around 30 columns in each row.
use --request-timeout (with seconds) as a CLI parameter for cqlsh, as follows:
If you are using Datastax
cqlsh
then you can specify client timeout seconds as a command line argument. The default is10
.$ cqlsh --request-timeout=3600
Datastax Documentation
I'm going to guess that you are also using secondary indexes. You are finding out firsthand why secondary index queries and ALLOW FILTERING queries are not recommended...because those type of design patterns do not scale for large datasets. Rebuild your model with query tables that support primary key lookups, as that is how Cassandra is designed to work.
Edit
"The variables that are constrained are cluster keys."
Right...which means they are not partition keys. Without constraining your partition key(s) you are basically scanning your entire table, as clustering keys are only valid (cluster data) within their partition key.
To change the client timeout limit in Apache Cassandra, there are two techniques:
Technique 1: This is a good technique:
Technique 2: This is not a good technique since, you are changing the setting in the client program (cqlsh) itself. Note: If you have already changed using technique 1 - then it will override the time specified using technique 2. Since, profile settings have highest priority.