I have been reading and learning about NoSQL and MongoDB, CouchDB, etc, for the last two days, but I still can't tell if this is the right kind of storage for me.
What worries me is the eventual consistency thing. Does that type of consistency only kick in when using clusters? (I'm hosting my sites in a single dedicated server, so I don't know if I can benefit from NoSQL) For which kind of applications is OK to have eventual consistency (instead of ACID), and for which ones it isn't? Can you give me some examples? What's the worst thing that can happen in an application for which is OK to have eventual consistency?
Another thing that I read is that MongoDB keeps a lot of things in memory. In the docs it says something about 32-bit systems having a 2gb limit of data. Is that because of the ram limitation for 32-bit systems?
- I have been reading and learning about NoSQL and MongoDB, CouchDB, etc, for the last two days, but I still can't tell if this is the right kind of storage for me.
NoSQL databases solve a set of problems, that are hard(er) to solve with traditional RDMS. NoSQL can be the right storage for you
if any of your problems are in that set.
- Does eventual consistency only kick in when using clusters?
Eventual consistency "kicks in" when you might read back different/previous version of data from the one that was just persisted. For example:
You persist the same piece of data into MORE THAN ONE location, let's say A and B. Depending on the configuration, a persist operation may return after only persisting to A ( and not to B just yet ). Right after that you read that data from B, which is not yet there. Eventually it will be there, but unfortunately not when you read it back
- For which kind of applications it is OK to have eventual consistency (instead of ACID), and for which ones it isn't?
NOT OK
=> You have a family bank account which has a $100 available. Now you and your spouse try to buy something at the same time (at different stores) for $100. If the bank had this implemented with "eventual consistency" model, over more than one node for example, your spouse could have spent $100 a couple of milliseconds after you already spent all of it. Would not be exactly a good day for the bank.
OK
=> You have 10000 followers on Twitter. You tweeted "Hey who wants to do some hacking tonight?". 100% consistency would mean that ALL those 10000 would receive your invitation at the same time. But nothing bad would really happen, if John saw your tweet 2 seconds after Mary did.
- What's the worst thing that can happen in an application for which is OK to have eventual consistency?
A huge latency between e.g. when node A gets the data, and node B gets the same data [they are in sync]. If NoSQL solution is any solid, that would be the worse thing that can happen.
- Another thing that I read is that MongoDB keeps a lot of things in memory. In the docs it says something about 32-bit systems having a 2gb limit of data. Is that because of the ram limitation for 32-bit systems?
from MongoDB docs:
"MongoDB is a server process that runs on Linux, Windows and OS X. It can be run both as a 32 or 64-bit application. We recommend running in 64-bit mode, since Mongo is limited to a total data size of about 2GB for all databases in 32-bit mode."
I can speak only for CouchDB but there is no need to choose between eventual consistency and ACID, they are not in the same category.
CouchDB is fully ACID. A document update is atomic, consistent, isolated and durable (using CouchDB's recommended production setting of delayed_commits=false, your update is flushed to disk before the 201 success code is returned). What CouchDB does not provide is multi-item transactions (since these are very hard to scale when the items are stored in separate servers). The confusion between 'transaction' and 'ACID' is regrettable but excusable given that typical RDBMS's usually support both.
Eventual consistency is about how database replicas converge on the same data set. Consider a master-slave setup in a traditional RDBMS. Some configurations of that relationship will use a distributed transaction mechanism, such that both master and slave are always in lock-step. However, it is common to relax this for performance reasons. The master can make transactions locally and then forward them lazily to the slave via a transaction journal. This is also 'eventual consistency', the two servers will converge on the same data set when the journal is fully drained. CouchDB goes further and removes the distinction between master and slaves. That is, CouchDB servers can be treated as equal peers, with changes made at any host being correctly replicated to the others.
The trick to eventual consistency is in how updates to the same item at different hosts are handled. In CouchDB, these separate updates are detected as 'conflicts' on the same item, and replication ensures that all of conflicting updates are present at all hosts. CouchDB then chooses one of these to present as the current revision. This choice can be revised by deleting the conflicts one doesn't want to keep.
Brewers CAP theorem is the best source to understand what are the options which are availbale to you. I can say that it all depends but if we talk about Mongo then it provides with the horizontally scalability out of the box and it is always nice in some situations.
Now about consistency. Actually you have three options of keeping your data up-to-date:
1)First thing to consider is "safe" mode or "getLastError()" as indicated by Andreas. If you issue a "safe" write, you know that the database has received the insert and applied the write. However, MongoDB only flushes to disk every 60 seconds, so the server can fail without the data on disk.
2) Second thing to consider is "journaling" (v1.8+). With journaling turned on, data is flushed to the journal every 100ms. So you have a smaller window of time before failure. The drivers have an "fsync" option (check that name) that goes one step further than "safe", it waits for acknowledgement that the data has be flushed to the disk (i.e. the journal file). However, this only covers one server. What happens if the hard drive on the server just dies? Well you need a second copy.
3)Third thing to consider is replication. The drivers support a "W" parameter that says "replicate this data to N nodes" before returning. If the write does not reach "N" nodes before a certain timeout, then the write fails (exception is thrown). However, you have to configure "W" correctly based on the number of nodes in your replica set. Again, because a hard drive could fail, even with journaling, you'll want to look at replication. Then there's replication across data centers which is too long to get into here. The last thing to consider is your requirement to "roll back". From my understanding, MongoDB does not have this "roll back" capacity. If you're doing a batch insert the best you'll get is an indication of which elements failed.
Anyhow there are a lot of scenarios when data consistency becomes developer's responsibility and it is up to you to be careful and include all the scenarios and adjust the DB schema because there is no "This is the right way to do it" in Mongo like we are used to in RDB-s.
About memory - this is totally a performance question, MongoDB keeps indexes and "working set" in RAM. By limiting your RAM your limit your working set. You can actually have an SSD and smaller amount of RAM rather than huge ammount of RAM and a HDD - at least these are official recommendations. Anyhow this question is individual, you should do the performance tests for your specific use cases