Is it possible to use DistCp to copy only files that match a certain pattern? For example. For /foo I only want *.log files.
相关问题
- Spark on Yarn Container Failure
- enableHiveSupport throws error in java spark code
- spark select and add columns with alias
- Unable to generate jar file for Hadoop
-
hive: cast array
> into map
相关文章
- Java写文件至HDFS失败
- mapreduce count example
- Could you give me any clue Why 'Cannot call me
- Hive error: parseexception missing EOF
- Exception in thread “main” java.lang.NoClassDefFou
- ClassNotFoundException: org.apache.spark.SparkConf
- How can I configure the maven shade plugin to incl
- How was the container created and how does it work
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use
foo/*.log
and that should suffice. You can experiment withhadoop fs -ls
statement here - if globbing works withfs -ls
, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
Put the input-files list into hdfs
Create the target dir
Run distcp using the input-files list and specifying the target hdfs dir: