可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers. Now, my requirements are:
- Fast Insertion of (Key, Value) pair [allowing duplicates]
- Fast retrieving of value/values based on key.
A nice solution of it is given here based on MultiMap. However, I want to store more key-values pairs in main memory with no/little bit performance penalty. I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that. Can anybody help me:
Is there any Java library which satisfies my all those needs properly
(above mentioned/other ds also acceptable. no issue with that) ?
Actually, I want an efficient java library data structure to store/retrieve
key-value/values pairs which takes less memory footprint and must be
built in-memory.
NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cabinet etc etc.My experience is not good with disk-baked solutions. So please avoid that :). Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits. So, we can save space there also.
回答1:
I don't think there's anything in the JDK which will do this.
However, implementing such a thing is a simple matter of programming. Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:
public class LongIntParallelHashMultimap {
private static final long NULL = 0L;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int[] get(long key) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int count = countHits(key, index);
int[] hits = new int[count];
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
}
index = successor(index);
}
return hits;
}
private int countHits(long key, int index) {
int numHits = 0;
while (keys[index] != NULL) {
if (keys[index] == key) ++numHits;
index = successor(index);
}
return numHits;
}
private int indexFor(long key) {
// the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
// see The Art of Computer Programming, section 6.4
// the constant has two important properties:
// (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
// (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
long hash = key * 5700357409661598721L;
return Math.abs((int) (hash % keys.length));
}
private int successor(int index) {
return (index + 1) % keys.length;
}
public int size() {
return size;
}
}
Note that this is a fixed-size structure. You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB. The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be. I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds. Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds. So, even though you have 5 GB available, don't use it all.
Note also that zero is not allowed as a key. If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.
回答2:
Is there any Java library which satisfies my all those needs properly.
AFAIK no. Or at least, not one that minimizes the memory footprint.
However, it should be easy write a custom map class that is specialized to these requirements.
回答3:
It's a good idea to look for databases, because problems like these are what they are designed for. In recent years Key-Value databases became very popular, e.g. for web services (keyword "NoSQL"), so you should find something.
The choice for a custom data structure also depends if you want to use a hard drive to store your data (and how safe that has to be) or if it completely lost on program exit.
If implementing manually and the whole db fits into memory somewhat easily, I'd just implement a hashmap in C. Create a hash function that gives a (well-spread) memory address from a value. Insert there or next to it if already assigned. Assigning and retrieval is then O(1). If you implement it in Java, you'll have the 4 byte overhead for each (primitive) object.
回答4:
Based on @Tom Andersons solution I removed the need to allocate objects, and added a performance test.
import java.util.Arrays;
import java.util.Random;
public class LongIntParallelHashMultimap {
private static final long NULL = Long.MIN_VALUE;
private final long[] keys;
private final int[] values;
private int size;
public LongIntParallelHashMultimap(int capacity) {
keys = new long[capacity];
values = new int[capacity];
Arrays.fill(keys, NULL);
}
public void put(long key, int value) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
if (size == keys.length) throw new IllegalStateException("map is full");
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int get(long key, int[] hits) {
if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
int index = indexFor(key);
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
if (hitIndex == hits.length)
break;
}
index = successor(index);
}
return hitIndex;
}
private int indexFor(long key) {
return Math.abs((int) (key % keys.length));
}
private int successor(int index) {
index++;
return index >= keys.length ? index - keys.length : index;
}
public int size() {
return size;
}
public static class PerfTest {
public static void main(String... args) {
int values = 110* 1000 * 1000;
long start0 = System.nanoTime();
long[] keysValues = generateKeys(values);
LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
long start = System.nanoTime();
addKeyValues(values, keysValues, map);
long mid = System.nanoTime();
int sum = lookUpKeyValues(values, keysValues, map);
long time = System.nanoTime();
System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
System.out.println("Expected " + values + " got " + sum);
}
private static long[] generateKeys(int values) {
Random rand = new Random();
long[] keysValues = new long[values];
for (int i = 0; i < values; i++)
keysValues[i] = rand.nextLong();
return keysValues;
}
private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
for (int i = 0; i < values; i++) {
map.put(keysValues[i], i);
}
assert map.size() == values;
}
private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
int[] found = new int[8];
int sum = 0;
for (int i = 0; i < values; i++) {
sum += map.get(keysValues[i], found);
}
return sum;
}
}
}
prints
Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s
Run on an 3.8 GHz i7 with Java 7 update 3.
This is much slower than the previous test because you are accessing main memory, rather than the cache at random. This is really a test of the speed of your memory. The writes are faster because they can be performed asynchronously to main memory.
Using this collection
final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
new Supplier<Set<Integer>>() {
public Set<Integer> get() {
return TDecorators.wrap(new TIntHashSet());
}
});
Running the same test with 50 million entries (which used about 16 GB) and -mx20g
I go the following result.
Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s
For 110 M entries you will need about 35 GB of memory and a machine 10 x faster than mine (3.8 GHz) to perform 5 million adds per second.
回答5:
If you must use Java, then implement your own hashtable/hashmap. An important property of your table is to use a linkedlist to handle collisions. Hence when you do a lookup, you may return all the elements on the list.
回答6:
Might be I am late in answering this question but elastic search will solve your problem.