I have the following table with five map type collections. The max number of elements in the collection is 12 and the maximum size of the item is 50 Bytes.
#
CREATE TABLE persons (
treeid int,
personid bigint,
birthdate text,
birthplace text,
clientnote text,
clientnoteisprivate boolean,
confidence int,
connections map<int, bigint>,
createddate timestamp,
deathdate text,
deathplace text,
familyrelations map<text, text>,
flags int,
gender text,
givenname text,
identifiers map<int, text>,
issues int,
media map<uuid, int>,
mergedpersonas map<int, bigint>,
note text,
primaryphotoid uuid,
quality int,
suffix text,
surname text,
userid uuid,
vitalstatus int,
PRIMARY KEY (treed,personid)
)
Here my partition key is the treeid and the unique key is the personid. i am trying to insert records into this table from a .net application using the datastax .net driver. I have about 200K records to insert and the performance degrades (goes beyond 200ms/op) as the number of inserted records increases. From the Opscenter i see that the Par New garbage collection time increases (goes beyond 20 ms)as the records are being inserted.
#
Now if i change the table with the different primary key like PRIMARY KEY (personid, treeid) and do the same inserts, the performance is much better (below 1 ms/op) and the garbage collection time is well under 1 ms
#
Well, why is the partition key makes difference in the performance here? I have other tables with the treeid,personid as the primary key but with out any map data types and they perform very well. I want to perform the query based on the treeid and how should i be able to do it? should i take the maps into different tables? or create a secondary index on the treeid column? whats more efficient for reading?
I am using datastax enterprise 4.0.1. cassandra 2.0.5. I have a three node cluster on centos 6.4 and replication factor of 3