hive overwrite directory move process as distcp?

2019-04-13 07:15发布

问题:

When I run an INSERT OVERWRITE DIRECTORY query in hive, it seem to store the results in a .hivexxxx staging folder and then move the files from there to the directory...

At the end of the map reduce process, it shows this:

Moving data to: hdfs://nameservice1/user/events/Click2/.hive-staging_hive_2015-11-21_08-32-49_909_6034680686432863037-1/-ext-10000
Moving data to: /user/events/Click2

this move process runs really slow and doesn't seem to be using distcp

is there a way to set hive to use distcp during that process or is there a way to set it so it doesn't put data into that staging foler? I don't see the point in that staging folder...

回答1:

Unless you're using HDFS federation and you've configured hive to put the .staging* dir for a job on a different FS/namespace than the destination dir, (which is very unlikely to ever happen with the default settings) you probably don't want hive to do the distcp. The problem is that what hive is doing now is that it is copying all the output files from the .staging dir to the final destination dir, and using distcp will do the same thing - copying - plus the overhead of spawning a whole mapreduce job for every file (that's the behavior I've seen in Hive 1.1), so performance will likely be much worse. Only possible exception is if your output files are insanely large...

But why copy if you don't have to? That means reading and re-writing all the files. An HDFS move/rename simply changes the metadata of the files and is nearly instant.

To get that behavior, I recommend adding the following (unfortunately undocumented) property to your hive-site.xml -

<property>
    <name>hive.exec.stagingdir</name>
    <value>${hive.exec.scratchdir}/${user.name}/.staging</value>
    <description>
      In Hive >= 0.14, set to ${hive.exec.scratchdir}/${user.name}/.staging
      In Hive < 0.14, set to ${hive.exec.scratchdir}/.staging

      You may need to manually create and/or set appropriate permissions on
      the parent dirs ahead of time.
    </description>
</property>

If ${hive.exec.scratchdir} does not get automatically substituted in your version of Hive, just look up its value and substitute that manually in the value above. For example, with the default value of hive.exec.scratchdir in Hive > 0.14, you would set this value to /tmp/hive/${user.name}/.staging and in Hive < 0.14, set to /tmp/hive-${user.name}/.staging (You shouldn't have to do this with ${user.name}, and it's not a good idea to do so for reasons that are off-topic for this answer)



标签: hadoop hive