Performance tuning a Hive query

I have a Hive query which is selecting about 30 columns and around 400,000 records and inserting them into another table. I have one join in my SQL clause, which is just an inner join.

The query fails because of a Java GC overhead limit exceeded.

What's strange is that if I remove the join clause and just select the data from the table (slightly higher volume) then the query works fine.

I'm pretty new to Hive. I can't understand why this join is causing memory exceptions.

Is there something that I should be aware of with regards to how I write Hive queries so that they don't cause these issues? Could anyone explain why the join might cause this issue but selecting a higher volume of data and the same number of columns does not.

Appreciate your thoughts on this. Thanks

回答1:

Depending on the version of Hive and your configuration, the answer to your question may vary. It would be easier if you could share your exact query along with the create statements of the two tables and an estimate of their sizes.

To get a better understanding of the problem, let's go through how a "regular" inner join works in Hive.

Hive join in MapReduce:

Here is a simplified description of how an inner join in Hive gets compiled to MapReduce. In general, if you have two tables t1 and t2 with a join query like:

SELECT
   t1.key, t1.value, t2.value
FROM
   t1
   JOIN
   t2 (ON t1.key = t2.key);

Where, t1 has the following contents:

k_1    v1_1
k_2    v1_2
k_3    v1_3

Where, t2 has the following contents:

k_2    v2_2
k_3    v2_3
k_4    v2_4

We would expect the join result to be

k_2    v1_2    v2_2
k_3    v1_3    v2_3

Assuming the tables are stored on HDFS, their contents will be split up into File Splits. A mapper will take a file split as input and emit out the key as the key column of the table and the value as the composite of the value column of the table and a flag (representing which table the record is from i.e. t1 or t2).

For t1:

k_1, <v1_1, t1>
k_2, <v1_2, t1>
k_3, <v1_3, t1>

For t2:

k_2, <v2_2, t2>
k_3, <v2_3, t2>
k_4, <v2_4, t2>

Now, these emitted out records go through the shuffle phase where all the records with the same keys are grouped together and sent to a reducer. The context of each reduce operation is one key and a list containing all the values corresponding to that key. In practice, one reducer will perform several reduce operations.

In the above example, we would get the following groupings:

k_1, <<v1_1, t1>>
k_2, <<v1_2, t1>, <v2_2, t2>>
k_3, <<v1_3, t1>, <v2_3, t2>>
k_4, <<v2_4, t2>>

Here is what happens in the reducer. For each of the values in the list of values, the reducer will perform a multiplication if the values correspond to different tables.

For k_1, there is no value from t2 and nothing is emitted.

For k_2, a multiplication of values is emitted - k_2, v1_2, v2_2 (since there is one value from each table, 1x1 = 1)

For k_3, a multiplication of values is emitted - k_3, v1_3, v2_3 (since there is one value from each table, 1x1 = 1)

For k_4, there is no value from t1 and nothing is emitted. Hence you obtain the result that you expected from your inner join.

Ok, so what do I do?

It's possible that there is skew in your data. In other words, when the reducer gets the data, the list of values corresponding to some key is very long which causes an error. To alleviate the problem, you may try bumping up the memory available to your JVM. You can do so by setting mapred.child.java.opts to a value like -Xmx512M in your hive-site.xml. You can query the present value of this parameter by doing set mapred.child.java.opts; in your Hive shell.
You can try using alternatives to "regular" join, e.g. map join. The above explanation of joins applies to regular joins where the joining happens in reducers. Depending on the version of Hive you are using, Hive may automatically be able to convert a regular join to map join which is faster (because the join happens in map phase). To enable the optimization, set hive.auto.convert.join to true. This property was introduced in Hive 0.7
In addition to setting hive.auto.convert.join to true, you may also set hive.optimize.skewjoin to true. This will work around the skew in your data problem described in 1.

回答2:

Many thanks for the response Mark. Much appreciated.

After many hours I eventually found out that the order of tables in the the join statement makes a difference. For optimum performance and memory management the last join should be the largest table.

Changing the order of my tables in the join statement fixed the issue.

See Largest Table Last at http://hive.apache.org/docs/r0.9.0/language_manual/joins.html

Your explanation above is very useful as well. Many Thanks