Problem
Following up on this question, it seems that a file- or disk-based Map
implementation may be the right solution to the problems I mentioned there. Short version:
- Right now, I have a
Map
implemented as aConcurrentHashMap
. - Entries are added to it continually, at a fairly fixed rate. Details on this later.
- Eventually, no matter what, this means the JVM runs out of heap space.
At work, it was (strongly) suggested that I solve this problem using SQLite, but after asking that previous question, I don't think that a database is the right tool for this job. So - let me know if this sounds crazy - I think a better solution would be a Map
stored on disk.
Bad idea: implement this myself. Better idea: use someone else's library! Which one?
Requirements
Must-haves:
- Free.
- Persistent. The data needs to stick around between JVM restarts.
- Some sort of searchability. Yes, I need the ability to retrieve this darn data as well as put it away. Basic result set filtering is a plus.
- Platform-independent. Needs to be production-deployable on Windows or Linux machines.
- Purgeable. Disk space is finite, just like heap space. I need to get rid of entries that are
n
days old. It's not a big deal if I have to do this manually.
Nice-to-haves:
- Easy to use. It would be great if I could get this working by the end of the week.
Better still: the end of the day. It would be really, really great if I could add one JAR to my classpath, changenew ConcurrentHashMap<Foo, Bar>();
tonew SomeDiskStoredMap<Foo, Bar>();
and be done. - Decent scalability and performance. Worst case: new entries are added (on average) 3 times per second, every second, all day long, every day. However, inserts won't always happen that smoothly. It might be
(no inserts for an hour)
then(insert 10,000 objects at once)
.
Possible Solutions
- Ehcache? I've never used it before. It was a suggested solution to my previous question.
- Berkeley DB? Again, I've never used it, and I really don't know anything about it.
- Hadoop (and which subproject)? Haven't used it. Based on these docs, its cross-platform-readiness is ambiguous to me. I don't need distributed operation in the foreseeable future.
- A SQLite JDBC driver after all?
- ???
Ehcache and Berkeley DB both look reasonable right now. Any particular recommendations in either direction?
The google-collections library, part of http://code.google.com/p/guava-libraries/, has some really useful Map tools. MapMaker in particular lets you make concurrent HashMaps with timed evictions, soft values that will be swept up by the garbage collector if you're running out of heap, and computing functions.
That will give you a Map cache that will clean up after itself and can work out its values. If you're able to compute values like that then great, otherwise it would map perfectly onto http://redis.io/ which you'd be writing into (to be fair, redis would probably be fast enough on its own!).
UPDATE (some 4 years after first post...): beware that in newer versions of ehcache, persistence of cache items is available only in the pay product. Thanks @boday for pointing this out.
ehcache is great. It will give you the flexibility you need to implement the map in memory, disk or memory with spillover to disk. If you use this very simple wrapper for java.util.Map then using it is blindingly simple:
Have you never heard about prevalence frameworks ?
EDIT some clarifications on the term.
Like James Gosling now says, no SQL DB is as efficient as an in-memory storage. Prevalence frameworks (most known being prevayler and space4j) are built upon this idea of an in-memory, maybe storable on disk, storage. How do they work ? In fact, it's deceptively simple : a storage object contains all persistent entities. This storage can only be changed by serializable operations. As a consequence, putting an object in storage is a Put operation performed in isolated context. As this operation is serializable, it may (depending upon configuration) be also saved on disk for long-term persistence. However, the main data repository is the memory, which proides undoubtly fast access time, at the cost of a high memory usage.
Another advantage is that, because of their obvious simplicity, these frameworks hardly contain more than a tenth of classes
Considering your question, the use of Space4J immediatly came to my mind (as it provides support for "passivation" of rarely used objects, that's to say their index key is in memory, but the objects are kept on disk as long as they're not used).
Notice you can also find some infos at c2wiki.
I came accross jdbm2 a few weeks ago. The usage is very simple. You should be able to get it to work in half an hour. One drawback is that the object which is put into the map must be serializable, i.e. implement
Serializable
. Other Cons are given in their website.However, all object persistence database are not a permanent solution for storing objects of you own java class. If you decide to make change to the fields of the class, you will no longer be able to reteive the object from the map collection. It is ideal to store standard serializable classes line
String
,Integer
, etc.Berkeley DB Java Edition has a Collections API. Within that API, StoredMap in particular, is a drop-in replacement for a ConcurrentHashMap. You'll need to create the Environment and Database before creating the StoredMap, but the Collections tutorial should make that pretty easy.
Per your requirements, Berkeley DB is designed to be easy to use and I think that you'll find that it has exceptional scalability and performance. Berkeley DB is available under an open source license, it's persistent, platform independent and allows you to search for data. The data can certainly be purged/deleted, as needed. Berkeley DB has long list of other features which you may find highly useful to your application, especially as your requirements change and grow with the success of the application.
If you decide to use Berkeley DB Java Edition, please be sure to ask questions on the BDB JE Forum. There's an active developer community that's happy to help answer questions and resolve problems.
We have a similar solution implemented using Xapian. It's fast, it's scalable, it provedes almost all search functionality you requested, it's free, multiplatform, and of course purgeable.