How NoSQL databases perform on aggregate functions

2019-07-18 18:31发布

问题:

We need to work periodically with a pretty big dataset (30-40GB). It has a lot of values ordered by time (and more information) but we basically need to perform some mathematical operations by month.

Our first approach was to use a MySQL database to back the data, as we have a reasonable experience with the engine and with the relational approach. However, the process takes too long and we were wondering if a NoSQL approach can do it better.

Basically the data that we need to express is:

Value: { NumericalValue, Year, Month }
Entity: List of 'Value'

We process this list three times, doing simple mathematical operations, and when I say 'process' I mean iterate through the dataset and perform the calculus. When everything is over, we have the same structure (but with different data):

Value: { NumericalValue, Year, Month }
Entity: List of 'Value'

It's now when we found the biggest problems as we need to calculate some AVERAGES and it takes a lot. As we repeat this process some times, I think that the most consuming tasks are:

1) Exporting the dataset to MySQL. Which means a lot of inserts from text files.

And when the data is transformed:

2) Compute some queries that contains aggregate functions (AVG,SUM) with LIMIT. 3) Compute some queries that contains aggreate functions with the whole dataset.

Usually, even with some indexes added, we feel that things take too long (20 mins some queries). Any tip or resolution strategy would be very appreciated. I feel that NoSQL databases aren't designed specifically for this, but maybe some experiences could help :).

Thanks for your time,

回答1:

Your task fits very well into Columnar databases. Column-oriented NoSQL(e.g. Cassandra) databases store data tables as sections of columns of data rather than as rows of data. This improves the speed of aggregations drastically. This have to do with systems that rely on hard disks for storage. If this is not the case(in-memory databases for examples) there are much more options for squeezing out performance.



标签: nosql redis