How to tell Hadoop to not delete temporary directo

2019-04-17 08:20发布

问题:

By default, hadoop map tasks write processed records to files in temporary directory at ${mapred.output.dir}/_temporary/_${taskid} . These files sit here until FileCommiter moves them to ${mapred.output.dir} (after task successfully finishes). I have case where in setup() of map task I need to create files under above provided temporary directory, where I write some process related data used later somewhere else. However, when hadoop tasks are killed, temporary directory is removed from HDFS.

Anyone knows if it is possible to tell Hadoop to not delete this directory after task is killed, and how to achieve that? I guess some property should be provided that I can configure.

Regards

回答1:

It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.

Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.