可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I try to build a map with the content of a file and my code is as below:

    System.out.println("begin to build the sns map....");
    String basePath = PropertyReader.getProp("oldbasepath");
    String pathname = basePath + "\\user_sns.txt";
    FileReader fr;
    Map<Integer, List<Integer>> snsMap = 
            new HashMap<Integer, List<Integer>>(2000000);
    try {
        fr = new FileReader(pathname);
        BufferedReader br = new BufferedReader(fr);
        String line; 
        int i = 1;
        while ((line = br.readLine()) != null) {
            System.out.println("line number: " + i);
            i++;

            String[] strs = line.split("\t");
            int key = Integer.parseInt(strs[0]);
            int value = Integer.parseInt(strs[1]);
            List<Integer> list = snsMap.get(key);
            //if the follower is not in the map
            if(snsMap.get(key) == null) 
                list = new LinkedList<Integer>();
            list.add(value);
            snsMap.put(key, list);
            System.out.println("map size: " + snsMap.size());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    System.out.println("finish building the sns map....");
    return snsMap;

The program is very fast at first but gets much slowly when the information printed is :

 map size: 1138338
 line number: 30923602
 map size: 1138338
 line number: 30923603 
 ....

I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler. Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?

回答1:

If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.

But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.

回答2:

You will have to check you program with some profiling tools to understand why it is slow. In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.

回答3:

Before you profiled, you will not know what is slow and what isn't.

Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.

An obivous optimization for your code is to move the line

snsMap.put(key, list);

into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.

Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.

Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.

The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.

A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.

If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.

回答4:

The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.