Hadoop DistCp using wildcards?

2020-05-23 03:41发布

问题:

Is it possible to use DistCp to copy only files that match a certain pattern? For example. For /foo I only want *.log files.

回答1:

I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:

distcp does not support wildcards. The closest you can do is to:

Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:

hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/ 
  | grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'}   > input-files.txt

Put the input-files list into hdfs

hadoop dfs -put input-files.txt  .

Create the target dir

hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/

Run distcp using the input-files list and specifying the target hdfs dir:

hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/  


回答2:

DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log and that should suffice. You can experiment with hadoop fs -ls statement here - if globbing works with fs -ls, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).



标签: hadoop