I'm developing a Java application which is dealing with a huge amount (~ 1.000.000) of key-value-pairs. The keys have a fixed size while the size of the values vary from 20 bytes up to ~1 kbyte. So in the worst case I got something around 1 GB of data.
The key-value-pairs are read and written by multiple threads (~ 3-5 threads) concurrently. The read and write operations affect in most cases only a single key-value-pair.
On top of the key-value-pairs a relational data model is placed which is used to select the desired keys. Currently I'm using H2 for the relational data but I'm unclear on which technology to use for storing the key-value-pairs.
- Which database should I use for storing the key-value-pairs?
- Is the Berkeley DB appropriate in my case?
Furthermore and more important, on which level should I implement synchronization?
- The database layer seems sensible to me but how should I deal with the connection management within the application?
- My first appraoch
- Using a connection pool and a separate connection for each read / write operation.
While this is easy to implement it seems to me that the synchronization of the connection pool becomes a bottleneck.
- Using a connection pool and a separate connection for each read / write operation.
- My second solution
- Use a single connection for all read operations and another one for all write operations but I'm not sure whether the connection allows concurrent accesses and whether it's serializing them in a scalable way.
Since the key-value-pairs are regularly queried by JTables the access time is critical. Should I trust on the caching system of the database or implement/use another cache on the application layer, e.g. EHCache?
Edit:
The application should run on older PCs so I can't put the whole data into memory.
Storing the key-value-pairs in the existing H2 database would be possible but wouldn't it make more sense to put them into a database specialized for key-value-pairs and their synchronization for better performance?
I'm also not concerned about ACID properties.
1) A NoSQL database seems suited for your requirements: accessing to values only by key. Berkeley is an ordered key-value store. Do you need order in keys? If not, check other solutions: Mongo, couchbase.
2) Syncing at database level would be the most sane choice. So I would go for your first approach. Your second approach will cause contention for sure, and will be harder to manage for developers.
3) Cache if you must. Do you have frequently accessed records? EHCache is fine, you can also use other systems like memcache. You will need to decide on cache layer in line with the DB you choose.
Milions of records or 1GB amount of data isn't huge amount for today databases. You can go with "traditional" DBRMS (PostgreSQL, MySQL, Oracle,...) or take "hot" technology (H2, MongoDB,..). Oracle Berkeley can store data in memory so read and write will be very fast. Make indexing.
Checkout MongoDB if you haven't. It's not transactional, but it is mostly memory resident and looks like it could simplify your stack if all aspects of ACIDity are not required.