Hi I have the following problem:
I'm storing strings and a corresponding list of integer values in an MultiValueMap<String, Integer>
I'm storing about 13 000 000 million strings and one string can have up to 500 or more values.
For every single value i will have random access on the Map. So worst case are 13 000 000* 500 put calls. Now the speed of the map is good but the memory overhead gets quite high. A MultiValueMap<String, Integer>
is nothing else then a HashMap/TreeMap<String, <ArrayList<Integer>>
. Both HashMap and TreeMap have quite a lot of memory Overhead. I wont be modifying the map once it is done, but I need it to be fast and as small as possible for random access in a program. (I'm storing it on disk and loading it on start, the serialized map file takes up about 600mb but in memory its about 3gb?)
the most memory efficient thing would be, to store the String in sorted string array and have a corresponding two dimensional int array for values. So access would be a binary search on the string array and getting the corresponding values.
Now I have three ways to get there:
I use a sorted MultivalueMap (TreeMap) for the creation phase to store everything.After I'm finished with getting all values, I get the string array by calling map.keyset().toArray(new String[0]);
Make a two dimensional int array and get all the values from the multivaluemap.
Pro: It's easy to implement, It is still fast during creation.
Con: It takes up even more memory during the copying from Map to Arrays.
I use Arrays or maybe ArrayLists from the start and store everything in there
Pro: least memory overhead.
Con: this would be enormously slow because i would have to sort/copy the Array every time a add a new Key, Also i would need to implement my own (propably even slower) sorting to keep the corresponding int array in the same order like the strings. Hard to implement
I use Arrays and a MultivalueMap as buffer. After the program finished 10% or 20% of the creation phase, I will add the values to the Arrays and keep them in order, then start a new Map.
Pro: Propably still fast enough and memory efficient enough.
Con: Hard to implement.
None of these solutions really feel right to me. Do you know any other solutions to this problem, maybe a memory efficient (MultiValue)Map implementation?
I know I could be using a database so don't bother posting it as an answer. I want to know how i could do this without using a database.
If you switched to Guava's Multimap -- I have no idea if that's possible for your application -- you might be able to use Trove and get
ListMultimap<String, Integer> multimap = Multimaps.newListMultimap(
new HashMap<String, Collection<Integer>>(),
new Supplier<List<Integer>>() {
public List<Integer> get() {
return new TIntListDecorator();
}
});
which will make a ListMultimap
that uses a HashMap
to map to List
values backed by int[]
arrays, which should be memory-efficient, though you'll pay a small speed penalty because of boxing. You might be able to do something similar for MultiValueMap
, though I have no idea what library that's from.
You can use compressed strings to reduce drastically the memory usage.
- Parameters to configure your JVM
- Comparison of its usage between various java versions
Furthermore, there are other more drastic solutions (it would require some reimplementation):
- Memory-disk based list implementation or suggestions about NoSQL database.
Depending on which Integer values you store in your map, a large amount of your heap memory overhead may be caused by having distinct Integer instances, which take up much more RAM than a primitive int value.
Consider using a Map
from String
to one of the many IntArrayList
implementations floating around (e.g. in Colt or in Primitive Collections for Java), which basically implement a List backed by an int array, instead of a being backed by an array of Integer instances.
First, consider the memory taken by the integers. You said that the range will be about 0-4000000. 24 bits is enough to represent 16777216 distinct values. If that is acceptable, you could use byte arrays for the integers, with 3 bytes per integer, and save 25%. You would have to index into the array something like this:
int getPackedInt(byte[] array, int index) {
int i = index*3;
return ((array[i] & 0xFF)<<16) + ((array[i+1] & 0xFF) <<8) + (array[i+2] & 0xFF);
}
int storePackedInt(byte[] array, int index, int value) {
assert value >= 0 && value <= 0xFFFFFF;
int i = index*3;
array[i] = (byte)((value>>16) & 0xFF);
array[i+1] = (byte)((value>>8) & 0xFF);
array[i+2] = (byte)(value & 0xFF);
}
Can you say anything about the distribution of the integers? If many of them will fit in 16 bits, you could use an encoding with a variable number of bytes per number (something like UTF-8 does for representing characters).
Next, consider whether you can save memory on the Strings. What are the characteristics of the Strings? How long will they typically be? Will many strings share prefixes? A compression scheme tailored to the characteristics of your application could save a lot of space (as falsarella pointed out). OR, if many strings will share prefixes, storing them in some type of search trie could be more efficient. (There is a type of trie called "patricia" which might be suitable for this application.) As a bonus, note that searching for Strings in a trie can be faster than searching a hash map (though you'd have to benchmark to see if that is true in your application).
Will the Strings all be ASCII? If so, 50% of the memory used for Strings will be wasted, as a Java char
is 16 bits. Again, in this case, you could consider using byte arrays.
If you only need to look Strings up, not iterate over the stored Strings, you could also consider something rather unconventional: hash the Strings, and keep only the hash. Since different String can hash to the same value, there is a chance that a String which was never stored, may still be "found" by a search. But if you use enough bits for the hash value (and a good hash function), you can make that chance so infinitesimally small that it will almost certainly never happen in the estimated lifespan of the universe.
Finally, there is the memory for the structure itself, which holds the Strings and integers. I already suggested using a trie, but if you decide not to do that, nothing will use less memory than parallel arrays -- one sorted array of Strings (which you can do binary search on, as you said), and a parallel array of arrays of integers. After you do a binary search to find an index into the String array, you can use the same index to access the array-of-integer array.
While you are building the structure, if you do decide that a search trie is a good choice, I would just use that directly. Otherwise, you could do 2 passes: one to build up a set of strings (then put them into an array and sort them), and a second pass to add the arrays of integers.
If there are patterns to your key strings, especially common roots, then a a Trie could be an effective method of storing significantly less data.
Here's the code for a working TrieMap.
NB: The usual advice on using EntrySet
to iterate across Map
s does not apply to Trie
s. They are exceptionally inefficient in a Trie
so please avoid requesting one if at all possible.
/**
* Implementation of a Trie structure.
*
* A Trie is a compact form of tree that takes advantage of common prefixes
* to the keys.
*
* A normal HashSet will take the key and compute a hash from it, this hash will
* be used to locate the value through various methods but usually some kind
* of bucket system is used. The memory footprint resulting becomes something
* like O(n).
*
* A Trie structure essentuially combines all common prefixes into a single key.
* For example, holding the strings A, AB, ABC and ABCD will only take enough
* space to record the presence of ABCD. The presence of the others will be
* recorded as flags within the record of ABCD structure at zero cost.
*
* This structure is useful for holding similar strings such as product IDs or
* credit card numbers.
*
*/
public class TrieMap<V> extends AbstractMap<String, V> implements Map<String, V> {
/**
* Map each character to a sub-trie.
*
* Could replace this with a 256 entry array of Tries but this will handle
* multibyte character sets and I can discard empty maps.
*
* Maintained at null until needed (for better memory footprint).
*
*/
protected Map<Character, TrieMap<V>> children = null;
/**
* Here we store the map contents.
*/
protected V leaf = null;
/**
* Set the leaf value to a new setting and return the old one.
*
* @param newValue
* @return old value of leaf.
*/
protected V setLeaf(V newValue) {
V old = leaf;
leaf = newValue;
return old;
}
/**
* I've always wanted to name a method something like this.
*/
protected void makeChildren () {
if ( children == null ) {
// Use a TreeMap to ensure sorted iteration.
children = new TreeMap<Character, TrieMap<V>>();
}
}
/**
* Finds the TrieMap that "should" contain the key.
*
* @param key
*
* The key to find.
*
* @param grow
*
* Set to true to grow the Trie to fit the key.
*
* @return
*
* The sub Trie that "should" contain the key or null if key was not found and
* grow was false.
*/
protected TrieMap<V> find(String key, boolean grow) {
if (key.length() == 0) {
// Found it!
return this;
} else {
// Not at end of string.
if (grow) {
// Grow the tree.
makeChildren();
}
if (children != null) {
// Ask the kids.
char ch = key.charAt(0);
TrieMap<V> child = children.get(ch);
if (child == null && grow) {
// Make the child.
child = new TrieMap<V>();
// Store the child.
children.put(ch, child);
}
if (child != null) {
// Find it in the child.
return child.find(tail(key), grow);
}
}
}
return null;
}
/**
* Remove the head (first character) from the string.
*
* @param s
*
* The string.
*
* @return
*
* The same string without the first (head) character.
*
*/
// Suppress warnings over taking a subsequence
private String tail(String s) {
return s.substring(1, s.length());
}
/**
*
* Add a new value to the map.
*
* Time footprint = O(s.length).
*
* @param s
*
* The key defining the place to add.
*
* @param value
*
* The value to add there.
*
* @return
*
* The value that was there, or null if it wasn't.
*
*/
@Override
public V put(String key, V value) {
V old = null;
// If empty string.
if (key.length() == 0) {
old = setLeaf(value);
} else {
// Find it.
old = find(key, true).put("", value);
}
return old;
}
/**
* Gets the value at the specified key position.
*
* @param o
*
* The key to the location.
*
* @return
*
* The value at that location, or null if there is no value at that location.
*/
@Override
public V get(Object o) {
V got = null;
if (o != null) {
String key = (String) o;
TrieMap<V> it = find(key, false);
if (it != null) {
got = it.leaf;
}
} else {
throw new NullPointerException("Nulls not allowed.");
}
return got;
}
/**
* Remove the value at the specified location.
*
* @param o
*
* The key to the location.
*
* @return
*
* The value that was removed, or null if there was no value at that location.
*/
@Override
public V remove(Object o) {
V old = null;
if (o != null) {
String key = (String) o;
if (key.length() == 0) {
// Its me!
old = leaf;
leaf = null;
} else {
TrieMap<V> it = find(key, false);
if (it != null) {
old = it.remove("");
}
}
} else {
throw new NullPointerException("Nulls not allowed.");
}
return old;
}
/**
* Count the number of values in the structure.
*
* @return
*
* The number of values in the structure.
*/
@Override
public int size() {
// If I am a leaf then size increases by 1.
int size = leaf != null ? 1 : 0;
if (children != null) {
// Add sizes of all my children.
for (Character c : children.keySet()) {
size += children.get(c).size();
}
}
return size;
}
/**
* Is the tree empty?
*
* @return
*
* true if the tree is empty.
* false if there is still at least one value in the tree.
*/
@Override
public boolean isEmpty() {
// I am empty if I am not a leaf and I have no children
// (slightly quicker than the AbstaractCollection implementation).
return leaf == null && (children == null || children.isEmpty());
}
/**
* Returns all keys as a Set.
*
* @return
*
* A HashSet of all keys.
*
* Note: Although it returns Set<S> it is actually a Set<String> that has been
* home-grown because the original keys are not stored in the structure
* anywhere.
*/
@Override
public Set<String> keySet() {
// Roll them a temporary list and give them a Set from it.
return new HashSet<String>(keyList());
}
/**
* List all my keys.
*
* @return
*
* An ArrayList of all keys in the tree.
*
* Note: Although it returns List<S> it is actually a List<String> that has been
* home-grown because the original keys are not stored in the structure
* anywhere.
*
*/
protected List<String> keyList() {
List<String> contents = new ArrayList<String>();
if (leaf != null) {
// If I am a leaf, a null string is in the set.
contents.add((String) "");
}
// Add all sub-tries.
if (children != null) {
for (Character c : children.keySet()) {
TrieMap<V> child = children.get(c);
List<String> childContents = child.keyList();
for (String subString : childContents) {
// All possible substrings can be prepended with this character.
contents.add((String) (c + subString.toString()));
}
}
}
return contents;
}
/**
* Does the map contain the specified key.
*
* @param key
*
* The key to look for.
*
* @return
*
* true if the key is in the Map.
* false if not.
*/
public boolean containsKey(String key) {
TrieMap<V> it = find(key, false);
if (it != null) {
return it.leaf != null;
}
return false;
}
/**
* Represent me as a list.
*
* @return
*
* A String representation of the tree.
*/
@Override
public String toString() {
List<String> list = keyList();
//Collections.sort((List<String>)list);
StringBuilder sb = new StringBuilder();
Separator comma = new Separator(",");
sb.append("{");
for (String s : list) {
sb.append(comma.sep()).append(s).append("=").append(get(s));
}
sb.append("}");
return sb.toString();
}
/**
* Clear down completely.
*/
@Override
public void clear() {
children = null;
leaf = null;
}
/**
* Return a list of key/value pairs.
*
* @return
*
* The entry set.
*/
public Set<Map.Entry<String, V>> entrySet() {
Set<Map.Entry<String, V>> entries = new HashSet<Map.Entry<String, V>>();
List<String> keys = keyList();
for (String key : keys) {
entries.add(new Entry<String,V>(key, get(key)));
}
return entries;
}
/**
* An entry.
*
* @param <S>
*
* The type of the key.
*
* @param <V>
*
* The type of the value.
*/
private static class Entry<S, V> implements Map.Entry<S, V> {
protected S key;
protected V value;
public Entry(S key, V value) {
this.key = key;
this.value = value;
}
public S getKey() {
return key;
}
public V getValue() {
return value;
}
public V setValue(V newValue) {
V oldValue = value;
value = newValue;
return oldValue;
}
@Override
public boolean equals(Object o) {
if (!(o instanceof TrieMap.Entry)) {
return false;
}
Entry e = (Entry) o;
return (key == null ? e.getKey() == null : key.equals(e.getKey()))
&& (value == null ? e.getValue() == null : value.equals(e.getValue()));
}
@Override
public int hashCode() {
int keyHash = (key == null ? 0 : key.hashCode());
int valueHash = (value == null ? 0 : value.hashCode());
return keyHash ^ valueHash;
}
@Override
public String toString() {
return key + "=" + value;
}
}
}