可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have configured three separate data directories in cassandra.yaml file as given below:

data_file_directories:
    - E:/Cassandra/data/var/lib/cassandra/data
    - K:/Cassandra/data/var/lib/cassandra/data

when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?

回答1:

You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.

This also let's you take advantage of the disk_failure_policy setting. You can read about the details here: http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2

In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).

If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.

You can read about the thought process behind the development of this issue here: https://issues.apache.org/jira/browse/CASSANDRA-4292

回答2:

Some what I am able to guess how the keyspace is split between multiple data directories. Based on the maximum available space and load on directories, SSTables of same column family written to the different data directories..