Performance degradation with Datastax Cassandra wh

2019-07-24 23:03发布

问题:

I have the following table with five map type collections. The max number of elements in the collection is 12 and the maximum size of the item is 50 Bytes.

#

CREATE TABLE persons (
  treeid int,
  personid bigint,
  birthdate text,
  birthplace text,
  clientnote text,
  clientnoteisprivate boolean,
  confidence int,
  connections map<int, bigint>,
  createddate timestamp,
  deathdate text,
  deathplace text,
  familyrelations map<text, text>,
  flags int,
  gender text,
  givenname text,
  identifiers map<int, text>,
  issues int,
  media map<uuid, int>,
  mergedpersonas map<int, bigint>,
  note text,
  primaryphotoid uuid,
  quality int,
  suffix text,
  surname text,
  userid uuid,
  vitalstatus int,
  PRIMARY KEY (treed,personid)
)

Here my partition key is the treeid and the unique key is the personid. i am trying to insert records into this table from a .net application using the datastax .net driver. I have about 200K records to insert and the performance degrades (goes beyond 200ms/op) as the number of inserted records increases. From the Opscenter i see that the Par New garbage collection time increases (goes beyond 20 ms)as the records are being inserted.

#

Now if i change the table with the different primary key like PRIMARY KEY (personid, treeid) and do the same inserts, the performance is much better (below 1 ms/op) and the garbage collection time is well under 1 ms

#

Well, why is the partition key makes difference in the performance here? I have other tables with the treeid,personid as the primary key but with out any map data types and they perform very well. I want to perform the query based on the treeid and how should i be able to do it? should i take the maps into different tables? or create a secondary index on the treeid column? whats more efficient for reading?

I am using datastax enterprise 4.0.1. cassandra 2.0.5. I have a three node cluster on centos 6.4 and replication factor of 3

回答1:

The first column mentioned in the primary key is known as the partition key. Any additional columns mentioned in the primary key are known as the clustering columns. All of the clustering columns for a given partition key are stored as a single Cassandra partition (guaranteed to be together on a single node) - what used to be known as a "wide row". So, each treeid will refer to a single partition with each personid begin a row within the partition.

How many treeid's do you have? If you had a small number of tree ids and a large number of persons, that would result in a very small number of Cassandra partitions, each with a large number of rows (a traditional wide row.)

Feel free to elaborate on what you are trying to do with this tree id, but superficially it sounds as if person id is a better choice for partition key.

OR... maybe you really want a "composite partition key":

PRIMARY KEY ((treed,personid))

So that a combination of tree id and person id are used to distinguish partitions.

It depends on how you really want to organize your data.



回答2:

I ended up changing the data model and its much better in performance now. I created two tables instead.

create table personids (treeid int, personid long, PRIMARY KEY(treeid,personid))

create table persons (........ PRIMARY KEY(personid))

So, I will insert into two tables and while querying i query the personids for a tree from personids and then query the persons table with these personids. I use IN operator to query the persons table