key-value store suggestion

I need a very basic key-value store for java. I started with a HashMap but it seems that HashMap is somewhat space inefficient (I'm storing ~20 million records, and seems to require ~6GB RAM).

The map is Map<Integer,String>, and so I'm considering using GNU Trove TIntObjectHashMap<byte[]>, and storing the map value as an ascii byte array rather than String.

As an alternative to that, is there a key-value store that only requires adding jar files, does not hold the entire map in RAM at once, and is still reasonably fast?

标签： java nosql

6条回答

Rolldiameter

2楼-- · 2020-02-09 04:08

BabuDB

BabuDB is an embedded non-relational database system. Its lean and simple design allows it to persistently store large amounts of key-value pairs without the overhead and complexity of similar approaches such as BerkeleyDB.

License: New BSD license, Language: Java

JDBM2

JDBM2 provides HashMap and TreeMap which are backed by disk storage.

License: Apache License 2.0, Language: Java

Banana DB

Banana DB is a self-contained key/value pair database implemented in Java.

License: Apache License 2.0, Language: Java

I've tried BabuDB and JDBM2 and they work fine. BabuDB is a little bit more difficult to set up, but potentially delivers higher performance than JDBM2.

These all all databases, which allow to persist data on disk. There are also solutions to hold a large map in memory (ehcache, hazelcast, ...).

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2020-02-09 04:11

Just wanted to reference some more open source options that became available over time since this question was first asked.

Apache 2, BTree, Apache Directory Project JDBM replacement effort:

http://directory.apache.org/mavibot/

MPL2/EPL1, RTree, MVStore, H2 Storage Engine:

http://www.h2database.com/html/mvstore.html

Apache 2, Xodus Environments, JetBrains YouTrack and Hub storage engine:

https://github.com/JetBrains/xodus

0人赞添加讨论(0) 举报

何必那么认真

4楼-- · 2020-02-09 04:18

Use Berkeley DB.

Berkeley DB stores object graphs, objects in collections, or simple binary key/value data directly in an a btree on disk. This simple, highly efficient approach removes all the unnecessary overhead in ORM solutions. Using the Direct Persistence Layer (DPL) Java developers annotate classes with storage information, much like JPA. This approach is familiar, efficient, and fast. The DPL reduces the complexity of data storage while not sacrificing speed.

This should definitely give you huge gains in memory and speed, while not increasing the complexity of your application. Enjoy!

0人赞添加讨论(0) 举报

Juvenile、少年°

5楼-- · 2020-02-09 04:23

Consider Koloboke Collections, which is up to 2 times faster than Trove according to various tests:

if configured to consume the same memory as Trove. Or alternatively, you can think it consumes considerably lesser memory if configured to be equally fast to Trove.

If you want to persist the map between JVM runs with very quick bootstrap, you might also be interested in Chronicle-Map which stores Strings in UTF-8 by default (so you shouldn't bother with conversions String <-> byte[] as with Koloboke/Trove). Chronicle-Map is ultra fast for persisted key-value store, but a bit slower that Koloboke and even Trove.

0人赞添加讨论(0) 举报

不美不萌又怎样

6楼-- · 2020-02-09 04:32

The map is Map, and so I'm considering using GNU Trove TIntObjectHashMap, and storing the map value as an ascii byte array rather than String.

This doesn't entirely make sense because a TIntObjectHashMap is not a Map. However, the approach is sound.

Do you know what kind of space savings I can expect over HashMap for Trove?

The best answer is to try it out.

But here are some rough estimates (assuming a 32bit JVM):

HashMap keys would need to be Integer instances. They will occupy ~18bytes per instance + 4 bytes per reference. Total 24 bytes.
Trove keys would be 4 byte int values.
String values would be 20 bytes + 12 bytes + 2 * number of "characters".
Byte array values would be 12 bytes + 1 * number of "characters".
I haven't examined the details of the respective hash table internal data structures.

That probably amounts to around 50% memory saving, though it depends critically on the average length of the value "strings". (The longer they are, the more they will dominate the space usage.)

FWIW, Trove publish their own benchmarks here. They don't look very convincing, but you should be able to dig out their benchmark code and modify it to better match your use-case.

0人赞添加讨论(0) 举报

淡お忘

7楼-- · 2020-02-09 04:34

http://www.mapdb.org/ is what you are looking for. It's a rocking fast persistent implementation of java.util.Map.

Features

Concurrent

MapDB has record level locking and state-of-art concurrent engine. Its performance scales nearly linearly with number of cores. Data can be written by multiple parallel threads.

Fast

MapDB has outstanding performance rivaled only by native DBs. It is result of more than a decade of optimizations and rewrites.

ACID

MapDB optionally supports ACID transactions with full MVCC isolation. MapDB uses write-ahead-log or append-only store for great write durability.

Flexible

MapDB can be used everywhere from in-memory cache to multi-terabyte database. It also has number of options to trade durability for write performance. This makes it very easy to configure MapDB to exactly fit your needs.

Hackable

MapDB is component based, most features (instance cache, async writes, compression) are just class wrappers. It is very easy to introduce new functionality or component into MapDB.

SQL Like

MapDB was developed as faster alternative to SQL engine. It has number of features which makes transition from relational database easier: secondary indexes/collections, autoincremental sequential ID, joins, triggers, composite keys…

Low disk-space usage

MapDB has number of features (serialization, delta key packing…) to minimize disk used by its store. It also has very fast compression and custom serializers. We take disk-usage seriously and do not waste single byte.

0人赞添加讨论(0) 举报