I'm looking for a key value store that can be used from an EC2 instance.
- item is just an unstructured string, no indexing required
- item size up to ~5MB but usually below 10kB
- lots of writes
- reading doesn't need to be fast, memcache can be put in front that caches frequently needed reads
- data is too big to fit into memory
- Eventual Consistency is fine
- daemon that can be accessed from multiple machines is required
Ideally something AWS hosted would be perfect but:
- S3 doesn't fit because of too many writes
- SimpleDB/DynamoDb don't fit because of item size limits and indexing is not required
As there are a lot of key value stores on the market it's hard to choose the best one. Which one would you recommend?
I found the perfect solution for my use case: memcachedb
It doesn't do fancy document/indexing, it's just a simple key value store.
I didn't do any performance testing yet though.
Edit:
We dropped memcachedb due to problems with replication. Instead we run now mongodb. Mongodb requires much more disk space, and more resources in general. But the replica sets work very reliable and are easy to set up.
Maybe you should try mongodb:
http://www.mongodb.org/display/DOCS/Amazon+EC2
Quickstart:
http://www.mongodb.org/display/DOCS/Amazon+EC2+Quickstart
Free courses at 10gen and video presentations:
http://www.10gen.com/presentations/nyc-meetup-group/mongodb-and-ec2-a-love-story
Other key-value storages:
http://google-opensource.blogspot.com/2011/07/leveldb-fast-persistent-key-value-store.html
Comments about Riak and their storages especially bitcask and innostore:
http://basho.com/blog/technical/2011/07/01/Leveling-the-Field/
RaptorDB: A extremely small size and fast embedded, noSql, persisted dictionary database using b+tree or MurMur hash indexing. It was primarily designed to store JSON data (see my fastJSON implementation), but can store any type of data that you give it.
HamsterDB: A delightful engine written in C++, which impressed me a
lot for its speed while I was using Aarons Watters code for indexing.
(RaptorDB eats it alive now... ahem!) It's quite large at 600KB for
the 64bit edition.
Esent PersistentDictionary: A project on CodePlex which is part of a
another project which implements a managed wrapper over the built in
Windows esent data storage engine. The dictionary performance goes
down exponentially after 40,000 items indexed and the index file just
grows on guid keys. Apparently after talks with the project owners,
it's a known issue at the moment.
Tokyo/Kyoto Cabinet: A C++
implementation of key store which is very fast. Tokyo cabinet is a
b+tree indexer while Kyoto cabinet is a MurMur2 hash indexer.
4aTech Dictionary: This is another article on CodeProject which does
the same thing, the commercial version at the web site is huge (450KB)
and fails dismally performance wise on guid keys after 50,000 items
indexed.
BerkeleyDB: The grand daddy of all database which is owned by Oracle
and comes in 3 flavours, C++ key store, Java key store and XML
database.
(Quotation source: http://www.codeproject.com/Articles/190504/RaptorDB)
Seems like a perfect use case for HBase. It gives great write throughput, especially if your insert keys are somewhat random. HBase is not usually advertised as a K/V store, but it should work just fine.
The AWS documentation presents some use cases you might want to have a closer look at. The downside is that HBase can do a lot more than just K/V, so it might be more complex (and complicated) than what you need.
Couchbase sounds like a good match for you needs. It's a lot like having memcached with disk storage.
Pros:
It's a key/value database. You can store whatever binary blob you want. As of version 2.0 it has support for storing your data as json and running some queries and map/reduce on it. But, if you don't need that, using it as key/value works great.
Of all the NoSQL databases I've tried, it's the fastest. This may be because your writes are not immediately committed to disk. Instead, you get an acknowledgment once a write is replicated in the cluster. Data is written to disk asynchronously. So, one potential downside is that if all your nodes crashed simultaneously (e.g. your data center loses power), you may lose data. Depending on the application this may or may not be an issue (and if your whole cluster goes down, you probably have bigger problems).
In my experience it has been reliable. If a node goes down, the cluster keeps working and it's very easy to do a failover. Adding new nodes is pretty easy too.
Data doesn't have to fit in memory. It gets stored on disk and paged in and out as necessary.
The admin interface is very, very nice. It has nifty live graphs to monitor the cluster.
It's backwards compatible with the memcached protocol. If you already have code that uses memcached, it'd be pretty straightforward to have it use Couchbase instead.
Cons:
- The product is still somewhat young, so documentation and support tools are somewhat lacking. This can be a bit annoying sometimes.