Java OutOfMemoryError: GC overhead limit exceeded

2019-09-01 03:06发布

问题:

Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.

I am processing a large text file (1GB) of CAIDA network topologies, this is basically a dump of the entire Internet IPv4 topology. Each line is of format "node continent region country city latitude longitude" and I need to filter all the duplicate nodes (e.g. each node with the same lat/longitude).

I assign a unique name to all nodes with the same geo location and maintain a hashmap of each geo location->unique name already encountered. I also maintain a hashmap of each oldname->unique name because in a next step I must process another file where these old names have to be mapped to the new unique name per location.

I wrote this in Java because this is where all my other processing happens but I'm getting the "GC overhead limit exceeded" error. Below is my code which is being executed and the error log:

        Scanner sc = new Scanner(new File(geo));
        String line = null;

        HashMap<String, String> nodeGeoMapper = new HashMap<String, String>(); // maps each coordinate to a unique node name
        HashMap<String, String> nodeMapper = new HashMap<String, String>(); // maps each original node name to a filtered node name (1 name per geo coordinate)

        PrintWriter output = new PrintWriter(geoFiltered);
        output.println("#node.geo Name\tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude");
        int frenchCounter = 0;

        // declare all variables used in loop to avoid creating thousands of tiny objects
        String[] fields = null;
        String name = null;
        String continent = null;
        String country = null;
        String region = null;
        String city = null;
        double latitude = 0.0;
        double longitude = 0.0;
        String key = null;
        boolean seenBefore = true;
        String newname = null;
        String nodename = null;

        while (sc.hasNextLine()) {
            line = sc.nextLine();
            if (line.startsWith("node.geo")) {

                // process a line and retrieve the fields
                fields = line.split("\t"); // split all fields using the space as separator
                name = fields[0];
                name = name.trim().split(" ")[1]; // nodes.geo' 'N...
                continent = ""; // is empty and gets skipped
                country = fields[2];
                region = fields[3];
                city = fields[4];
                latitude = Double.parseDouble(fields[5]);
                longitude = Double.parseDouble(fields[6]);

                // we only want one node for each coordinate pair so we map to a unique name
                key = makeGeoKey(latitude, longitude);

                // check if we have seen a node with these coordinates before
                seenBefore = true;
                if (!nodeGeoMapper.containsKey(key)) {
                    newname = "N"+nodeCounter;
                    nodeCounter++;
                    nodeGeoMapper.put(key, newname);
                    seenBefore = false;
                    if (country.equals("FR"))
                        frenchCounter++;
                }
                nodename = nodeGeoMapper.get(key); // retrieve the unique name assigned to these geo coordinates
                nodeMapper.put(name, nodename); // keep a reference from old name to new name so we can map later


                if (!seenBefore) {
                //  System.out.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
                    output.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
                }

            }
        }
        sc.close();
        output.close();
        nodeGeoMapper = null;

Error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Matcher.<init>(Unknown Source)
at java.util.regex.Matcher.toMatchResult(Unknown Source)
at java.util.Scanner.match(Unknown Source)
at java.util.Scanner.hasNextLine(Unknown Source)
at DataProcessing.filterGeoNodes(DataProcessing.java:236)
at DataProcessing.main(DataProcessing.java:114)

During execution my java process was constantly running on 80% CPU with a total of 1,000,000K (roughly) memory (laptop has 4GB total). The output file got to 59987 unique nodes so this is the amount of key values in the GeoLocation->Name hashmap. I dont know the size of the oldName->NewName hashmap but this should be less than Integer.Max_value because there are not that many lines in my textfile.

My two questions are:

  • how can I improve my code to use less memory or avoid having so much GC? (Edit: please keep it Java 7 compatible)

  • (solved) I've read threads on JVM settings like -Xmx1024m but I dont know where in the Eclipse IDE I can change these settings. Can someone please show me where I need to set these settings and which settings I may want to try?

Thank you

SOLVED: for people with a similar problem, the issue was the nodeMapper hashmap which had to store 34 million String objects which resulted in over 4GB of memory required. I was able to run my program by first disabling the GC threshold -XX:-UseGCOverheadLimit and then allocating 4GBRAM to my Java process using -Xmx4gb. It took a long time to process it but it did work, it was just slow because once Java reaches 3-4GB RAM it spends a lot of time collecting garbage rather than processing the file. A stronger system would not have had any problems. Thanks for all the help!

回答1:

For the JVM arguments in Eclipse run configuration

Also you can try adding this option when running: -XX:-UseGCOverheadLimit

Interesting explanation of this flag and your error message here