Note: I browsed all topics on this problem and I understand that it's often down to JVM settings and efficient coding but I dont know how to improve even more.
I am processing a large text file (1GB) of CAIDA network topologies, this is basically a dump of the entire Internet IPv4 topology. Each line is of format "node continent region country city latitude longitude" and I need to filter all the duplicate nodes (e.g. each node with the same lat/longitude).
I assign a unique name to all nodes with the same geo location and maintain a hashmap of each geo location->unique name already encountered. I also maintain a hashmap of each oldname->unique name because in a next step I must process another file where these old names have to be mapped to the new unique name per location.
I wrote this in Java because this is where all my other processing happens but I'm getting the "GC overhead limit exceeded" error. Below is my code which is being executed and the error log:
Scanner sc = new Scanner(new File(geo));
String line = null;
HashMap<String, String> nodeGeoMapper = new HashMap<String, String>(); // maps each coordinate to a unique node name
HashMap<String, String> nodeMapper = new HashMap<String, String>(); // maps each original node name to a filtered node name (1 name per geo coordinate)
PrintWriter output = new PrintWriter(geoFiltered);
output.println("#node.geo Name\tcontintent\tCountry\tregion\tcity\tlatitude\tlongitude");
int frenchCounter = 0;
// declare all variables used in loop to avoid creating thousands of tiny objects
String[] fields = null;
String name = null;
String continent = null;
String country = null;
String region = null;
String city = null;
double latitude = 0.0;
double longitude = 0.0;
String key = null;
boolean seenBefore = true;
String newname = null;
String nodename = null;
while (sc.hasNextLine()) {
line = sc.nextLine();
if (line.startsWith("node.geo")) {
// process a line and retrieve the fields
fields = line.split("\t"); // split all fields using the space as separator
name = fields[0];
name = name.trim().split(" ")[1]; // nodes.geo' 'N...
continent = ""; // is empty and gets skipped
country = fields[2];
region = fields[3];
city = fields[4];
latitude = Double.parseDouble(fields[5]);
longitude = Double.parseDouble(fields[6]);
// we only want one node for each coordinate pair so we map to a unique name
key = makeGeoKey(latitude, longitude);
// check if we have seen a node with these coordinates before
seenBefore = true;
if (!nodeGeoMapper.containsKey(key)) {
newname = "N"+nodeCounter;
nodeCounter++;
nodeGeoMapper.put(key, newname);
seenBefore = false;
if (country.equals("FR"))
frenchCounter++;
}
nodename = nodeGeoMapper.get(key); // retrieve the unique name assigned to these geo coordinates
nodeMapper.put(name, nodename); // keep a reference from old name to new name so we can map later
if (!seenBefore) {
// System.out.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
output.println("node.geo "+nodename+"\t"+continent+"\t"+country+"\t"+region+"\t"+city+"\t"+latitude+"\t"+longitude);
}
}
}
sc.close();
output.close();
nodeGeoMapper = null;
Error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Matcher.<init>(Unknown Source)
at java.util.regex.Matcher.toMatchResult(Unknown Source)
at java.util.Scanner.match(Unknown Source)
at java.util.Scanner.hasNextLine(Unknown Source)
at DataProcessing.filterGeoNodes(DataProcessing.java:236)
at DataProcessing.main(DataProcessing.java:114)
During execution my java process was constantly running on 80% CPU with a total of 1,000,000K (roughly) memory (laptop has 4GB total). The output file got to 59987 unique nodes so this is the amount of key values in the GeoLocation->Name hashmap. I dont know the size of the oldName->NewName hashmap but this should be less than Integer.Max_value because there are not that many lines in my textfile.
My two questions are:
how can I improve my code to use less memory or avoid having so much GC? (Edit: please keep it Java 7 compatible)
(solved) I've read threads on JVM settings like -Xmx1024m but I dont know where in the Eclipse IDE I can change these settings. Can someone please show me where I need to set these settings and which settings I may want to try?
Thank you
SOLVED: for people with a similar problem, the issue was the nodeMapper hashmap which had to store 34 million String objects which resulted in over 4GB of memory required. I was able to run my program by first disabling the GC threshold -XX:-UseGCOverheadLimit and then allocating 4GBRAM to my Java process using -Xmx4gb. It took a long time to process it but it did work, it was just slow because once Java reaches 3-4GB RAM it spends a lot of time collecting garbage rather than processing the file. A stronger system would not have had any problems. Thanks for all the help!