I read a lot about different NoSQL databases but I'm still undecided which solution would be the most appropriate in my situation - probably because I have no experience with NoSQL databases - except REDIS, which I'm using for caching.
In my scenario I have to store time-based historic values (mixed float, integer, boolean values based on datapoint) for about 200,000 datapoints. Typically each datapoint will get a new value every 15 minutes, or less, totaling something in the range of a billion values per year. Some datapoints may even store a value every few seconds, up to a theoretical maximum of 31,536,000 values per year for a single datapoint (we don't need sub-second precision), but that will be rare.
The number of datapoints could double easily in future, meaning it has to be a scalable system.
Reading
To visualize the data, it must be possible to quickly analyze the data of a single datapoint in any stored time range - be that a few hours or a whole year. In that range I need min/max/avg values in a specific resolution, typically the time range divided by about 1000.
In mySQL that would be done like this:
SELECT TRUNCATE(Timestamp / 3600) AS slot, MIN(value), AVG(value), MAX(value)
FROM data
WHERE Timestamp >= [from]
AND Timestamp < [to]
AND DatapointID = [some ID]
GROUP BY 1
...which would return data for a chart with the precision of 1 hour.
How can this be done with NoSQL databases so that the information is available in just a few milliseconds?
This is my primary concern.
Writing
Data is added near real-time, so that shouldn't be a lot of write operations. Updates of existing data must be possible, but will be extremely rare. Data will come in time-sorted, but I can't really guarantee that.
Other requisites
- It must be possible to quickly copy all (or alternatively portions) of the data to external development machines, which should not require hours. This must be possible without affecting the production system.
- It must be possible to backup all the data without stopping the database (daily backup).
- We have to import about half a billion of existing data, which needs to complete in a reasonable time.
- The database must be accessible with PHP.
Which NoSQL database is best suited for these needs and how can the above data query be done with it?
Well first question that came into my mind, why does it have to be a NoSQL database? You explained how you would do it in MySQL but did not read why you don't want to use it.
First of you suggested the following query for MySQl
SELECT TRUNCATE(Timestamp / 3600) AS slot, MIN(value), AVG(value), MAX(value)
FROM data
WHERE Timestamp >= [from]
AND Timestamp < [to]
AND DatapointID = [some ID]
GROUP BY 1
I would recommend the following query;
SELECT TRUNCATE(Timestamp / 3600) AS slot, MIN(value), AVG(value), MAX(value)
FROM data
WHERE DatapointID = [some ID]
AND Timestamp >= [from]
AND Timestamp < [to]
GROUP BY 1
This is because DatapointID should be an index. So MySQL can limit the records on the id, and then look at the Timestamp and not search all the records do some math and then match on the index.
That said about MySQL back to your question about NoSQL.
Last year I checked CouchDB, MongoDB vs MySQL for my study. Although it is a total different use case I had I could share some of the results.
MySQL
- read: 8.500 records/s
- write: 400 records /s
MongoDB
- read: 17.500 records/s
- write: 13.000 records/s
CouchDB
- read&write: 300 records/s
So CouchDB was slow in my test. I just insert simple records, don't have the code any more.
A nice feature in MongoDB is that the id's also contain the timestamp of when they were created. Although is is tricky for when you want to use MongoDB when you already have data.
Also explained by the MySQL query, you need indexes to keep you speed high. Well MongoDB supports them.
- MongoDB has an equivalent of mysqldump, don't know how fast it is so you have to test that with your own data.
- Not sure about this one
- Define reasonable, but MongoDB is 32,5 times faster with writing then MySQL
- Well it has a PHP driver so it has a driver.
Hoped the info helped, it also helps to just try some of the Databases. Often a day just playing with the databases gives you more info then a week reading about them
edit after first comment
The SQL query would roughly look like the following in MongoDB.
Used the following docs
- where docs
- find docs
- aggregation docs
- match docs
So with MongoDB the query has two parts, the where clause and the aggregation.
In sql the where clause Timestamp >= [from] AND Timestamp < [to] AND DatapointID = [some ID]
The where in MongoDB is actualy easy to learn (I found it easy)(see where docs & find docs for more info).
But the where part would look like
{
DatapointID: ID,
Timestamp: {$gte: from},
Timestamp: {$lte: till},
}
ID
, from
& till
are values to be filled in!
Then the hard part make the aggregation. But hell there are docs for to make things a little easier. So see aggregation docs & match docs to see where I got my info from.
But the group function will look like something as the following
{
_id: 1,
minValues: {$min: value},
avgValue: {$avg: value},
maxValue: {$max: value},
}
- note: not sure if
_id: 1
works but otherwise you need a valid column
value
is a column name in the record.
Now we need to finish the query. So you can execute it. It will look at the end then like the following:
db.articles.aggregate(
{ $match : {
DatapointID: ID,
Timestamp: {$gte: from},
Timestamp: {$lte: till},
} },
{ $group: {
_id: 1,
minValues: {$min: value},
avgValue: {$avg: value},
maxValue: {$max: value},
}}
);
code is not tested
Henriq has excellent points in his answer.
Your numbers are reasonable, IMHO, but will require very careful design and selection of the tools. You did not describe some critical requirements about f.ex. in the area of ACID, and if you don't need even "The D", then you are pretty safe with almost anything available.
Considering what you told, I give some comments:
- Instead of dropping existing work on MySQL, you could try some creative approaches, such as IMDB page versioning DB such as IBM SolidDB. It scale quite nicely and back-ups can be very easy with some design as you can practically take a file copy to other system without breaking the DB (with full ACID)
- My experience from No/NewSQL is of course limited to those situations I have had hands on myself, but all I have seen support about the same conclusions:
- Cassandra is most predictable in expert hands. MongoDB "out of the box" delivers a great promise, but getting it to scale up may be a pain. Aerospike is commercial technology with strong functionality (has alchemyDB background). VoltDB/HBase are promising but seem to have some scalability issues which I'm not sure where they are coming (not investigated enough).
Maybe you should have a "table of requirements" in place to be able to decide what way to go? Trust me, there are huge differences in between NoSQL implementations.
cheers, //Jari