Custom HashMap Code Issue

2019-01-20 14:52发布

问题:

I have following code, where I used HashMap (using two parallel arrays) for storing key-value pairs (key can have multiple values). Now, I have to store and load it for future use that's why I store and load it by using File Channel. Issue with this code is: I can store nearly 120 millions of key-value pairs in my 8 GB server (actually, I can allocate nearly 5 gb out of 8 gb for my JVM, and those two parallel arrays takes nearly 2.5 gb, other memory are used for various processing of my code). But, I have to store nearly 600/700 millions of key-value pairs. Can anybdoy help me how to modify this code thus I can store nearly 600/700 millions of key-value pairs. Or any comment on this will be nice for me. Another point, I have to load and store the hashmap to/from memory. It takes little bit long time using file channel. As per various suggestions of Stack Overflow, I didn't find faster one. I have used ObjectOutputStream, Zipped output stream also, however, slower than below code. Is there anyway to store those two parallel arrays in such a way thus loading time will be much faster. I have given below in my code a test case. Any comment on that will also be helpful to me.

import java.io.*;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Arrays;
import java.util.Random;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.io.RandomAccessFile;

public class Test {

    public static void main(String args[]) {


        try {

            Random randomGenerator = new Random();

            LongIntParallelHashMultimap lph = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");

            for (int i = 0; i < 110000000; i++) {
                lph.put(i, randomGenerator.nextInt(200000000));
            }

            lph.save();

            LongIntParallelHashMultimap lphN = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");
            lphN.load();

            int tt[] = lphN.get(1);

            System.out.println(tt[0]);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

class LongIntParallelHashMultimap {

    private static final long NULL = -1L;
    private final long[] keys;
    private final int[] values;
    private int size;
    private int savenum = 0;
    private String str1 = "";
    private String str2 = "";

    public LongIntParallelHashMultimap(int capacity, String st1, String st2) {
        keys = new long[capacity];
        values = new int[capacity];
        Arrays.fill(keys, NULL);
        savenum = capacity;
        str1 = st1;
        str2 = st2;
    }

    public void put(long key, int value) {
        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        int index = indexFor(key);
        int count = countHits(key, index);
        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) {
                ++numHits;
            }
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        return Math.abs((int) ((key * 5700357409661598721L) % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

    public void load() {
        try {
            FileChannel channel2 = new RandomAccessFile(str1, "r").getChannel();
            MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
            mbb2.order(ByteOrder.nativeOrder());
            assert mbb2.remaining() == savenum * 8;
            for (int i = 0; i < savenum; i++) {
                long l = mbb2.getLong();
                keys[i] = l;
            }
            channel2.close();

            FileChannel channel3 = new RandomAccessFile(str2, "r").getChannel();
            MappedByteBuffer mbb3 = channel3.map(FileChannel.MapMode.READ_ONLY, 0, channel3.size());
            mbb3.order(ByteOrder.nativeOrder());
            assert mbb3.remaining() == savenum * 4;
            for (int i = 0; i < savenum; i++) {
                int l1 = mbb3.getInt();
                values[i] = l1;
            }
            channel3.close();
        } catch (Exception e) {
            System.out.println(e);
        }
    }

    public void save() {
        try {
            FileChannel channel = new RandomAccessFile(str1, "rw").getChannel();
            MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 8);
            mbb.order(ByteOrder.nativeOrder());

            for (int i = 0; i < savenum; i++) {
                mbb.putLong(keys[i]);
            }
            channel.close();

            FileChannel channel1 = new RandomAccessFile(str2, "rw").getChannel();
            MappedByteBuffer mbb1 = channel1.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 4);
            mbb1.order(ByteOrder.nativeOrder());

            for (int i = 0; i < savenum; i++) {
                mbb1.putInt(values[i]);
            }
            channel1.close();
        } catch (Exception e) {
            System.out.println("IOException : " + e);
        }
    }
}

回答1:

I doubt this is possible, given the datatypes you have declared. Just multiply the sizes of the primitive types.

Each row requires 4 bytes to store an int and 8 bytes to store a long. 600 million rows * 12 bytes per row = 7200 MB = 7.03 GB. You say you can allocate 5 GB to the JVM. So even if it was all heap and stored only this custom HashMap, it will not fit. Consider shrinking the size of the datatypes involved or storing it somewhere other than RAM.



回答2:

Have the database on disk, and not in memory. Rewrite your operations so that they don't operate on arrays, but instead operate on buffers. Then you can open a sufficiently large file, and have the operations access the portion they need using a mapped buffer. Try whether your application performs better when you implement a cache of the few most recently mapped memory regions, so you won't have to map and unmap common regions too often, but instead can keep them mapped in.

This should give you the best of both worlds, disk and ram:

  • Random access to any portion of the data structure is easy to implement
  • Access to often used portions of the table will be cached
  • Seldom used portions of the table will not occupy any memory

As you can see, this depends a lot on locality: if some keys are more common than others, things will perform well, whereas nicely distributed keys will cause a new disk operation for each access. So while nice distributions are desirable for most in-memory hash maps, other structures which map often-used keys to similar locations will perform better here. Those will interfere with collision handling, though.



回答3:

Better to use in-memory database like sqlite, which will give good result.