I set up a Cassandra cluster on AWS. What I want to get is increased I/O throughput (number of reads/writes per second) as more nodes are added (as advertised). However, I got exactly the opposite. The performance is reduced as new nodes are added.
Do you know any typical issues that prevents it from scaling?
Here is some details:
I am adding a text file (15MB) to the column family. Each line is a record. There are 150000 records. When there is 1 node, it takes about 90 seconds to write. But when there are 2 nodes, it takes 120 seconds. I can see the data is spread to 2 nodes. However, there is no increase in throughput.
The source code is below:
public class WordGenCAS {
static final String KEYSPACE = "text_ks";
static final String COLUMN_FAMILY = "text_table";
static final String COLUMN_NAME = "text_col";
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.out.println("Usage: WordGenCAS <input file> <host1,host2,...>");
System.exit(-1);
}
String[] contactPts = args[1].split(",");
Cluster cluster = Cluster.builder()
.addContactPoints(contactPts)
.build();
Session session = cluster.connect(KEYSPACE);
InputStream fis = new FileInputStream(args[0]);
InputStreamReader in = new InputStreamReader(fis, "UTF-8");
BufferedReader br = new BufferedReader(in);
String line;
int lineCount = 0;
while ( (line = br.readLine()) != null) {
line = line.replaceAll("'", " ");
line = line.trim();
if (line.isEmpty())
continue;
System.out.println("[" + line + "]");
String cqlStatement2 = String.format("insert into %s (id, %s) values (%d, '%s');",
COLUMN_FAMILY,
COLUMN_NAME,
lineCount,
line);
session.execute(cqlStatement2);
lineCount++;
}
System.out.println("Total lines written: " + lineCount);
}
}
The DB schema is the following:
CREATE KEYSPACE text_ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
USE text_ks;
CREATE TABLE text_table (
id int,
text_col text,
primary key (id)
) WITH COMPACT STORAGE;
Thanks!