I want to compute a multiway join in Hadoop framework. When the records of each relation get bigger from a threshold and beyond I face two memory problems,
1) Error: GC overhead limit exceeded,
2) Error: Java heap space.
The threshold is the 1.000.000 / relation for a chain join and a star join.
In the join computation I use some hash tables i.e.
Hashtable< V, LinkedList< K>> ht = new Hashtable< V, LinkedList< K>>( someSize, o.75F);
These errors occur when I hash the input records and only then for the moment. During the hashing I have quite many for loops which, produce a lot of temporary objects. For this reason I get the 1) problem. So, I solved the 1) problem by setting K = StringBuilder which is a final class. In other words I reduced the amount of temporary objects by having only few objects which their value, content changes but not themselves.
Now, I am dealing with the 2) problem. I increased the heap space in each of the nodes of my cluster by setting the appropriate variable in the file $HADOOP_HOME/hadoop/conf/hadoop-env.sh. The problem still remained. I did a very basic monitoring of the heap by using VisualVM. I monitored only the master node and especially the JobTracker and the local TaskTracker daemons. I didn't notice any heap overflow during this monitoring. Also the PermGen space didn't overflow.
So for the moment, in the declaration,
Hashtable< V, LinkedList< K>> ht = new Hashtable< V, LinkedList< K>>( someSize, o.75F);
I am thinking of setting V = SomeFinalClass. This SomeFinalClass will help me to keep the amount of objects low and consequently the memory usage. Of course a SomeFinalClass object will have the same hash code independently of its content by default. So I will not be able to use this SomeFinalClass as a key in the hash table above. In order to solve this problem I am thinking of overriding the default hashCode() method and by a similar String.hashCode() method. This method will produce a hash code based on the content of a SomeFinalClass object.
What is your opinion on the problems and the solutions above? What would you do?
Should I monitor also the DataNode daemon? Both of the errors above are TaskTracker errors, DataNode errors or both?
Finally, will the solutions above solve the problems for an arbitrary amount of records / relation? Or soon or later I will have the same problem again?