Hive Map join : out of memory Exception

2019-04-25 06:36发布

问题:

I am trying to perform map side with one big Table (10G) and small Table (230 MB). With the small i will use all the columns to produce output records, after joining on key columns

I have used below setting

set hive.auto.convert.join=true;

set hive.mapjoin.smalltable.filesize=262144000;

Logs :

**2013-09-20 02:43:50     Starting to launch local task to process map join;      maximum       memory = 1065484288

2013-09-20 02:44:05     Processing rows:        200000  Hashtable size: 199999  Memory usage:   430269904       rate:0.404

2013-09-20 02:44:14     Processing rows:        300000  Hashtable size: 299999  Memory usage:   643070664       rate:0.604

Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:313)
        at java.util.jar.Manifest$FastInputStream.<init>(Manifest.java:308)
        at java.util.jar.Manifest.read(Manifest.java:176)
        at java.util.jar.Manifest.<init>(Manifest.java:50)
        at java.util.jar.JarFile.getManifestFromReference(JarFile.java:168)
        at java.util.jar.JarFile.getManifest(JarFile.java:149)
        at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:696)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:228)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at org.apache.hadoop.util.RunJar$1.run(RunJar.java:126)
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID:
  Stage-7
Logs:
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.MapredLocalTask
ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask**

but still i am facing OOM exception , Heap size set in my cluster is 1 GB. Please assist which properties do i need to consider and tune to make this map side join work

回答1:

Processing rows: 300000 Hashtable size: 299999 Memory usage: 643070664 rate:0.604

At 300k rows the HT already uses 60% of your heap. First question to ask: are you sure you got the table order right, is the small table in the join really the smaller table in your data? When writing the query, the large table should be the last in the JOIN clause. Which Hive version are you on 0.9 or 0.11?

If you are on Hive 0.11 and you are specifying the join correctly then the first thing to try would be to increase the Heap size. From the above data (300k row ~> 650Mb Heap) you can figure out how much heap you need.



回答2:

I faced this problem and was only able to get over it by using set hive.auto.convert.join=false



回答3:

set hive.auto.convert.join = false; it will not give u memory exception.



回答4:

You should take this into account, especially when tables are stored with compression, the table size maybe not be too large but when it is decompressed it can grow 10x or more, on top of that representing the data in hash table takes even more space. So your table might be smaller than ~260MB which is the value you set for hive.mapjoin.smalltable.filesize but the hash table representation of decompressed version of it might not, and that's why hive tries to load the table in memory which eventually causes the OutOfMemoryError exception. According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization: "There is no check to see if the table is a compressed one or not and what the potential size of the table can be."



回答5:

set hive.auto.convert.join = false;

It won't give you a memory exception because it is not using mapside join. It is using the normal mapreduce task itself.