-->

Cassandra CQL where clause with multiple collectio

2019-07-09 05:38发布

问题:

My data model:-

tid                                  | codes        | raw          | type
-------------------------------------+--------------+--------------+------
a64fdd60-1bc4-11e5-9b30-3dca08b6a366 | {12, 34, 53} | {sdafb=safd} |  cmd

CREATE TABLE MyTable (
tid       TIMEUUID,
type      TEXT,
codes     SET<INT>,
raw       TEXT,
PRIMARY KEY (tid)
);
CREATE INDEX ON myTable (codes);

How to query the table to return rows based on multiple set values.

This works:-

select * from logData where codes contains 34;

But i want to get row based on multiple set values and none of this works:-

select * from logData where codes contains 34, 12; or 
select * from logData where codes contains 34 and 12; or
select * from logData where codes contains {34, 12};

Kindly assit.

回答1:

If I create your table structure and insert a similar row to yours above, I can check for multiple values in the codes collection like this:

aploetz@cqlsh:stackoverflow2> SELECT * FROM mytable 
    WHERE codes CONTAINS 34 
      AND codes CONTAINS 12
      ALLOW FILTERING;

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 2569f270-1c06-11e5-92f0-21b264d4c94d | {12, 34, 53} | {sdafb=safd} |  cmd

(1 rows)

Now as others have mentioned, let me also tell you why this is a terrible idea...

With a secondary index on the collection (and with the cardinality appearing to be fairly high) every node will have to be checked for each query. The idea with Cassandra, is to query by partition key as often as possible, that way you only have to hit one node per query. Apple's Richard Low wrote a great article called The sweet spot for Cassandra secondary indexes. It should make you re-think the way you use secondary indexes.

Secondly, the only way I could get Cassandra to accept this query, was to use ALLOW FILTERING. What this means, is that the only way Cassandra can apply all of your fitlering criteria (WHERE clause) is to pull back every row and individually filter-out the rows that do not meet your criteria. Horribly inefficient. To be clear, the ALLOW FILTERING directive is something that you should never use.

In any case, if codes are something that you will need to query by, then you should design an additional query table with codes as a part of the PRIMARY KEY.



回答2:

The data model you are using is highly inefficient. Sets are meant to be used to get a set of data for a given primary key and not the other way round. If that is what is needed, you will have to rethink the model itself.

I would suggest creating different columns for each value you are using in a set and then using those columns as a composite primary key.



回答3:

Are you really looking to get ALL log entries based on just codes? That could be quite a large dataset. Realistically, wouldn't you be looking at specific dates / date ranges? I'd key on that, and then use codes for filtering, or even filter on codes entirely on the client side.

If you have many codes, and you index on the sets, it might result in very high cardinality of the index, which would cause you issues. Whether you have your own lookup table, or use an index, remember that you essentially have a "table" where the pk is the value, and there are rows for that value for every "row" that matches the value. If that looks unacceptably large, then that's exactly what it is.

I'd recommend revisiting the requirement - again...do you really need all log entries EVER that match a certain code combination?

If you really do need to analyse the whole lot, then I'd recommend using Spark to run the job. You could then run a Spark job, and each node would deal with data on the same node; this will significantly reduce the impact compared to doing full table processing entirely in the application.



回答4:

I know it's late. IMO the model with few minor changes would be sufficient to achieve what is expected. What one can do is to have as many rows as members of the power set of the set being queried.

CREATE TABLE data_points_ks.mytable (
    codes frozen<set<int>>,
    tid timeuuid,
    raw text,
    type text,
    PRIMARY KEY (codes, tid)
) WITH CLUSTERING ORDER BY (tid ASC)

INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {34, 53}, '{sdafb=safd}', 'cmd');
INSERT INTO mytable (tid, codes, raw, type) VALUES (now(), {12, 34, 53}, '{sdafb=safd}', 'cmd');

 tid                                  | codes        | raw          | type
--------------------------------------+--------------+--------------+------
 8ae81763-1142-11e8-846c-cd9226c29754 |     {34, 53} | {sdafb=safd} |  cmd
 8746adb3-1142-11e8-846c-cd9226c29754 |     {12, 53} | {sdafb=safd} |  cmd
 fea77062-1142-11e8-846c-cd9226c29754 |         {34} | {sdafb=safd} |  cmd
 70ebb790-1142-11e8-846c-cd9226c29754 |     {12, 34} | {sdafb=safd} |  cmd
 6c39c843-1142-11e8-846c-cd9226c29754 |         {12} | {sdafb=safd} |  cmd
 65a954f3-1142-11e8-846c-cd9226c29754 |         null | {sdafb=safd} |  cmd
 03c60433-1143-11e8-846c-cd9226c29754 |         {53} | {sdafb=safd} |  cmd
 82f68d70-1142-11e8-846c-cd9226c29754 | {12, 34, 53} | {sdafb=safd} |  cmd

Then the following queries are sufficient and do not need any filtering.

SELECT * FROM mytable 
WHERE codes = {12, 34};

OR

SELECT * FROM mytable 
WHERE codes = {34};