I'm looking for a key value store that can be used from an EC2 instance.
- item is just an unstructured string, no indexing required
- item size up to ~5MB but usually below 10kB
- lots of writes
- reading doesn't need to be fast, memcache can be put in front that caches frequently needed reads
- data is too big to fit into memory
- Eventual Consistency is fine
- daemon that can be accessed from multiple machines is required
Ideally something AWS hosted would be perfect but:
- S3 doesn't fit because of too many writes
- SimpleDB/DynamoDb don't fit because of item size limits and indexing is not required
As there are a lot of key value stores on the market it's hard to choose the best one. Which one would you recommend?
Seems like a perfect use case for HBase. It gives great write throughput, especially if your insert keys are somewhat random. HBase is not usually advertised as a K/V store, but it should work just fine. The AWS documentation presents some use cases you might want to have a closer look at. The downside is that HBase can do a lot more than just K/V, so it might be more complex (and complicated) than what you need.
Maybe you should try mongodb:
http://www.mongodb.org/display/DOCS/Amazon+EC2
Quickstart:
http://www.mongodb.org/display/DOCS/Amazon+EC2+Quickstart
Free courses at 10gen and video presentations:
http://www.10gen.com/presentations/nyc-meetup-group/mongodb-and-ec2-a-love-story
Other key-value storages:
http://google-opensource.blogspot.com/2011/07/leveldb-fast-persistent-key-value-store.html
Comments about Riak and their storages especially bitcask and innostore:
http://basho.com/blog/technical/2011/07/01/Leveling-the-Field/
(Quotation source: http://www.codeproject.com/Articles/190504/RaptorDB)
Couchbase sounds like a good match for you needs. It's a lot like having memcached with disk storage.
Pros:
It's a key/value database. You can store whatever binary blob you want. As of version 2.0 it has support for storing your data as json and running some queries and map/reduce on it. But, if you don't need that, using it as key/value works great.
Of all the NoSQL databases I've tried, it's the fastest. This may be because your writes are not immediately committed to disk. Instead, you get an acknowledgment once a write is replicated in the cluster. Data is written to disk asynchronously. So, one potential downside is that if all your nodes crashed simultaneously (e.g. your data center loses power), you may lose data. Depending on the application this may or may not be an issue (and if your whole cluster goes down, you probably have bigger problems).
In my experience it has been reliable. If a node goes down, the cluster keeps working and it's very easy to do a failover. Adding new nodes is pretty easy too.
Data doesn't have to fit in memory. It gets stored on disk and paged in and out as necessary.
The admin interface is very, very nice. It has nifty live graphs to monitor the cluster.
It's backwards compatible with the memcached protocol. If you already have code that uses memcached, it'd be pretty straightforward to have it use Couchbase instead.
Cons:
I found the perfect solution for my use case: memcachedb
It doesn't do fancy document/indexing, it's just a simple key value store.
I didn't do any performance testing yet though.
Edit:
We dropped memcachedb due to problems with replication. Instead we run now mongodb. Mongodb requires much more disk space, and more resources in general. But the replica sets work very reliable and are easy to set up.