I have a web application running on Java stack (Struts 2 + Spring + Hibernate) and persisted in MySQL. I looked at NoSQL databases and they are certainly easy to reason about and work with than a RDBMS. It's a music streaming app which stores artist information and allows users to save playlists.
I am wondering whether there are any advantages (performance?, hardware cost?, simplified code?, scalability?) of switching to a NoSQL DB (CouchDB?, MongoDB?, Cassandra?). What would I lose/gain by switching to a NoSQL database?
Please advise.
The polite interpretation of "NoSQL" has become Not Only SQL
. If you have data that is indeed truly relational, or if your functionality depends on things like joins and ACIDity, then you should store that data in a relational way. In this post, I'll explain how I use MySQL alongside two NoSQL data stores. Modern, web-scale data storage is all about understanding how to pick the best tool(s) for the job(s).
That said, NoSQL is really a reaction to the fact that the relational method and way of thinking has been applied to problems where it's not actually a very good fit (typically huge tables with tens of millions of rows or more). Once tables get that large, the typical SQL "best practice" has been to manually shard the data -- that is, putting records 1 through 10,000,000 in table A, 10,000,001 through 20,000,001 in table B, and so on. Then, typically in the application model layer, the lookups are performed according to this scheme. This is what's called application-aware
scaling. It's time-intensive and error prone, but to scale something up while maintaining MySQL for the long table store, it's become a more or less standard MO. NoSQL represents, to me, the application-unaware
alternative.
Key-Value
When I had a MySQL prototype start getting too big for its own good, I personally moved as much data as possible to the lightning-fast Membase, which outperforms Memcached and adds persistence. Membase is a distributed key-value store that scales more or less linearly (Zynga uses it to handle a half-million ops per second, for instance) by adding more commodity servers into a cluster -- it's therefore a great fit for the cloud age of Amazon EC2, Joyent, etc.
It's well known that distributed key-value stores are the best way to get enormous, linear scale. The weakness of key-value is queryability and indexing. But even in the relational world, the best practice for scalability is to offload as much effort onto the application servers as possible, doing joins in memory on commodity app servers instead of asking the central RDB cluster to handle all of that logic. Since simple select
plus application logic
are really the best way to achieve massive scale even on MySQL, the transition to something like Membase (or its competitors like Riak) isn't really too bad.
Document Stores
Sometimes -- though I would argue less often than many think -- an application's design inherently requires secondary indices, range queryability, etc. The NoSQL approach to this is through a document store
like MongoDB. Like Membase, Mongo is very good in some areas where relational databases are particularly weak, like application-unaware
scaling, auto-sharding
, and maintaining flat response times even as dataset size balloons
. It's significantly slower than Membase and a bit trickier to do pure horizontal scale, but the benefit is that it's highly queryable. You can query on parameters and ranges in real time, or you can use Map/Reduce to perform complex batch operations on truly enormous data sets.
On the same project I mentioned above, which uses Membase to serve tons of live player data, we use MongoDB to store analytics/metrics data, which is really where MongoDB shines.
Why to keep things in SQL
I touched briefly on the fact that 'truly relational' information should stay in relational databases. As commenter Dan K. points out, I missed the part where I discuss the disadvantages of leaving RDBMS, or at least of leaving it entirely.
First, there's SQL itself. SQL is well-known and has been an industry standard for a long time. Some "NoSQL" databases like Google's App Engine Datastore (built on Big Table) implement their own SQL-like language (Google's is called, cutely, GQL for Google Query Language
). MongoDB takes a fresh approach to the querying problem with its delightful JSON query objects. Still, SQL itself is a powerful tool for getting information out of data, which is often the whole point of databases to begin with.
The most important reason to stay with RDBMS is ACID, or Atomicity, Consistency, Isolation, Durability
. I won't re-hash the state of Acid-NoSQL, as it's well-addressed in this post on SO. Suffice it to say, there's a rational reason Oracle's RDBMS has such a huge market that isn't going anywhere: some data needs pure ACID compliance. If your data does (and if it does, you're probably well aware of that fact), then so does your database. Keep that pH low!
Edit: Check out Aaronaught's post here. He represents the business-to-business perspective far better than I could, in part because I've spent my entire career in the consumer space.
I think that it very much depends on what you want to store in the database. I don't have experience with CouchDB or Cassandra so I will let someone else speak for them but I frequently use MongoDB and MySQL.
If you were developing application that required transactions e.g. a billing application you would definitely want to use MySQL because of its support for transactions. MySQL is ACIDic that is it is Atomic, Consistent, Isolated and Durable. What that essentially means is that when you update a row in MySQL - it is GUARANTEED to have happened. However the issue with MySQL is that it does not scale horizontally (by adding more and more servers) very easily. MySQL servers tend to be scaled vertically by adding more memory, HDD space etc but they eventually hit a ceiling and it can reach an enormous cost.
MongoDB is a document database. It stores JSON-like documents inside collections and is schema-less - so each document can be different. This is great for flexibility in your application. A lot of developers say that noSql solutions are developed more for programmers and they tend to be much easier to build with (in my experience). In addition MongoDB scales horizontally by sharding the database into chunks. In fact this can even be automated now.
But there are drawbacks to using MongoDB. If you are using it in production you really MUST put in a replication slave with it. This is because MongoDB does not have full single server durability. So if you suffer a power failure you will probably have to repair the entire MongoDB database which can take hours. This is probably not a big deal cost-wise if you are well-funded but if you are a new organisation with little money it can be difficult (use cloud computing? ). In addition MongoDB does not support transactions which is necessary to guarantee Atomicity and Isolation. Finally MongoDB is only eventually consistent (though I have seen a few sides to this argument) - which means that when a write happens all other processes are not GUARANTEED to see the information straight away - only eventually.
In my opinion if you are storing artist information and meta-data about tracks then MongoDB would be a good solution. If you were storing user data, billing data etc then store it in MySQL.
There is only one correct answer to the question as it is: change your current solution only if you are having performance troubles or you are expecting vast increase in traffic and have measured (Via stress tests) that your architecture doesn't fit.
Otherwise - there is no need to even evaluate alternatives.
For what it's worth, I like Aaronaught's answer to a very similar question asked here.
I've found that NoSQL databases are poor for prototyping, because you have to structure your data with the knowledge of how you'll get it out. With NoSQL the schema matches the needs of your queries. But in a prototype you don't yet know how you'll get the data out, and you'll find yourself either doing way too many queries or refactoring your schema every time you want to add a new feature to your prototype.
With a relational database, you just normalize your data and you can ask any question you want. You only need to refactor the schema if your model didn't match the real-world entities properly.
I've had to refactor my MongoDB database several times, once each time I added a new way to look at the data in the web app. Not surprisingly, I'm converging on a relational schema that takes little advantage of the nested arrays and objects possible with a document database.
If you look around you'll see that the most successful uses of NoSQL are for people who developed their app with a relational database, and now that they understand their features, can switch to NoSQL knowing exactly what to put into it to satisfy their queries. If you're still exploring your app and the kinds of questions you'll want to ask of your database, I recommend sticking to relational.
As several people liked the answer of Aaronaught but the corresponding question was deleted in the meantime, I copied his answer from a Stackoverflow archive:
The original name of this technology before people started calling it
"NoSQL" was a distributed key/value store. This is a far more
descriptive name, and I originally remember looking at it and going
"hey, cool, I'll bet that will end up being very useful to a lot of
people." The term has since expanded to essentially include "anything
that isn't a relational database", but usually, when most people talk
about NoSQL, they are talking about key/value stores.
Ever since the term NoSQL was coined, it's been getting touted as a
silver bullet. I'm interested in products like Cassandra and follow
their progress, but they are still immature technologies, and to claim
that they are "replacing" SQL or RDBMSes in general (or that they will
in the near future) is specious reasoning at best, if not an
out-and-out lie.
Products and technologies fitting under the NoSQL umbrella are geared
toward the following problem domain:
- You plan to deploy a large-scale, high-concurrency database (hundreds of GB, thousands of users);
- Which doesn't need ACID guarantees;
- Or relationships or constraints;
- Stores a fairly narrow set of data (the equivalent of 5-10 tables in SQL);
- Will be running on commodity hardware (i.e. Amazon EC2);
- Needs to be implemented on a very low budget and "scaled out."
This actually describes a lot of web sites today. Google and Twitter
fit very neatly into these requirements. Does it really matter if a
few tweets are lost or delayed? On the other hand, these specs apply
to nearly 0% of business systems, which is what a very high number of
us work on developing. Most businesses have very different
requirements:
- Medium-to-large-scale databases (10-100 GB) with fairly low concurrency (hundreds of users at most);
- ACID (especially the A and C - Atomicity and Consistency) is a hard requirement;
- Data is highly correlated (hierarchies, master-detail, histories);
- Has to store a wide assortment of data - hundreds or thousands of tables are not uncommon in a normalized schema (more for
denormalization tables, data warehouses, etc.);
- Run on high-end hardware;
- Lots of capital available (if your business has millions of customers then you can probably find $25k or so lying behind the
couch).
High-end SQL databases (SQL Server, Oracle, Teradata, Vertica, etc.)
are designed for vertical scaling, they like being on machines with
lots and lots of memory, fast I/O through SANs and SSDs, and the
occasional horizontal scaling through clustering (HA) and partitioning
(HC).
"NoSQL" is often compared favourably to "SQL" in performance terms.
But fully maxed-out, a high-end SQL database server or cluster will
scale almost infinitely. That is how they were intended to be
deployed. Beware of dubious benchmarks comparing poorly-normalized,
poorly-indexed SQL databases running mysql on entry-level servers (or
worse, cloud servers like Amazon EC2) to similarly-deployed NoSQL
databases. Apples and oranges. If you work with SQL, don't be scared
by that hype.
SQL isn't going anywhere. DBAs are no more likely to vanish as a
result of NoSQL than PHP programmers were as a result of Java and XML.
NoSQL isn't going anywhere either, because the development community
has correctly recognized that RDBMSes aren't always the optimal
solution to every problem.
So, as a developer you owe it to yourself to at least learn what NoSQL
is, what products it refers to (Cassandra, BigTable, Voldemort, db4o,
etc.), and how to build and code against a simple database created
with one of these. But don't start throwing away all your SQL
databases yet or thinking that your career is going to be made
obsolete - that's hype, not reality.