NoSQL refers to non-relational data stores that break with the history of relational databases and ACID guarantees. Popular open source NoSQL data stores include:
- Cassandra (tabular, written in Java, used by Cisco, WebEx, Digg, Facebook, IBM, Mahalo, Rackspace, Reddit and Twitter)
- CouchDB (document, written in Erlang, used by BBC and Engine Yard)
- Dynomite (key-value, written in Erlang, used by Powerset)
- HBase (key-value, written in Java, used by Bing)
- Hypertable (tabular, written in C++, used by Baidu)
- Kai (key-value, written in Erlang)
- MemcacheDB (key-value, written in C, used by Reddit)
- MongoDB (document, written in C++, used by Electronic Arts, Github, NY Times and Sourceforge)
- Neo4j (graph, written in Java, used by some Swedish universities)
- Project Voldemort (key-value, written in Java, used by LinkedIn)
- Redis (key-value, written in C, used by Craigslist, Engine Yard and Github)
- Riak (key-value, written in Erlang, used by Comcast and Mochi Media)
- Ringo (key-value, written in Erlang, used by Nokia)
- Scalaris (key-value, written in Erlang, used by OnScale)
- Terrastore (document, written in Java)
- ThruDB (document, written in C++, used by JunkDepot.com)
- Tokyo Cabinet/Tokyo Tyrant (key-value, written in C, used by Mixi.jp (Japanese social networking site))
I'd like to know about specific problems you - the SO reader - have solved using data stores and what NoSQL data store you used.
Questions:
- What scalability problems have you used NoSQL data stores to solve?
- What NoSQL data store did you use?
- What database did you use prior to switching to a NoSQL data store?
I'm looking for first-hand experiences, so please do not answer unless you have that.
We moved part of our data from mysql to mongodb, not so much for scalability but more because it is a better fit for files and non-tabular data.
In production we currently store:
with a daily turnover of around 10GB.
The database is deployed in a "paired" configuration on two nodes (6x450GB sas raid10) with apache/wsgi/python clients using the mongodb python api (pymongo). The disk setup is probably overkill but thats what we use for mysql.
Apart from some issues with pymongo threadpools and the blocking nature of the mongodb server it has been a good experience.
I switched from MySQL(InnoDB) to cassandra for a M2M system, which basically stores time-series of sensors for each device. Each data is indexed by (device_id,date) and (device_id,type_of_sensor,date). The MySQL version contained 20 millions of rows.
MySQL:
Cassandra:
Note: I have also used elasticsearch (document oriented based on lucene) and I think it should be considered as a NoSQL database. It is distributed, reliable and often fast (some complex queries can perform quite badly).