We are using NoSQL (Cassandra) in our project. We have a Table A (5000 records) which is a master table. We have another Table B (2000 records). Table B have 4 columns and Table A have 25 columns. We exposed a REST service to get all records from B; like /service/getB. This service will return 6 columns in response as –
{
"result": [
{
"col1FromB": "1B",
"col2FromB": "2B",
"col3FromB": "3B",
"col4FromB": "4B",
"col1FromA": "1A",
"col2FromA": "2A"
},
{
"col1FromB": "11B",
"col2FromB": "12B",
"col3FromB": "13B",
"col4FromB": "14B",
"col1FromA": "11A",
"col2FromA": "12A"
}
]
}
So, there is a look up query to Table A for each item in Table B. This is how I am doing it –
//Get all from Table B (took 90 ms in Local and 30 ms in Test)
Select select = QueryBuilder.select().from("B");
List<B> bList = cassandraOperations.select(select, B.class);
//Loop through bList and do a lookup using id in Table A (took 46000 ms (46 sec) in Local (horrible) and 6000 ms (6 sec) in Test)
For(B b: bList) {
Select select = QueryBuilder.select(“col1FromA”, “col2FromA”).from("A");
select.where(QueryBuilder.eq(“id”, b.getId()));
A a = cassandraOperations.selectOne(select, A.class);
----
----
//Prepare final Pojo with a and b objects and add into a List<finalPjo> and return
}
So, the lookup time is very high in Local environment and also quite not good in Test environment. All I am using is Java collections only.
Is there any way to make it better so that we get records in lesser time.
For(B b: bList) {
Select select = QueryBuilder.select(“col1FromA”, “col2FromA”).from("A");
select.where(QueryBuilder.eq(“id”, b.getId()));
A a = cassandraOperations.selectOne(select, A.class);
This code performs blocking request cassandraOperations.selectOne
in each iteration, it means that each next iteration have to wait the previous one. All 2000 requests will be executed one by one and for a long time.
To avoid this, use asynchronous way to get records in the loop (as I see, you are using Spring and selectOne can be replaced by selectOneAsynchronously which returns ResultSetFuture, save these futures in some list, and use it to retrieve records when all requests are sent).
Cassandra distributes it's data across its nodes based on partition keys. It guarantees that all rows within a partition (a set of rows with the same partition key) are on the same node making SELECT statements for a full or partial partition blazingly fast.
If you have a query that pulls down multiple partitions then each partition may be located on different nodes causing network traffic during the select which will kill your performance. By adding a second table you are compounding the problem.
In Cassandra you should look at your queries and then if possible create one table per query. The Cassandra data model lends towards performance when you embrace data duplication and avoid joins.
So I would create a new table that pre-joins the data for the query into a table called C. When you write your data in A you would write it both to A and C and when you write data to B you will write it to both B and C. If possible you want to put data that will be queried together in the same partition. If you are pulling down the full data set every time that endpoint is called you may want to think of using a single partition key for all pieces of data in the table (since you have a relatively small amount of data) which will guarantee that when you read the table the entire read will be off of a single node.
I imagine you are seeing good performance on your local machine because your query isn't hitting the network.