Is it possible to use DistCp to copy only files that match a certain pattern? For example. For /foo I only want *.log files.
问题:
回答1:
I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/
| grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'} > input-files.txt
Put the input-files list into hdfs
hadoop dfs -put input-files.txt .
Create the target dir
hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/
Run distcp using the input-files list and specifying the target hdfs dir:
hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/
回答2:
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log
and that should suffice. You can experiment with hadoop fs -ls
statement here - if globbing works with fs -ls
, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).