hadoop getmerge to another machine

2019-02-19 09:32发布

问题:

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?

The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.

Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).

I am looking for something similar to how scp works, like:

hadoop dfs -getmerge /user/hduser/Job-output user@someIP:/home/user/

Alternatively, I would also like to get the HDFS data from a remote host to my local machine.

Could unix pipelines be used in this occasion?

For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.

This will do exactly what you need:

hadoop fs -cat /user/hduser/Job-output/* | ssh user@remotehost.com "cat >mergedOutput.txt"

fs -cat will read all files in sequence and output them to stdout.

ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

hadoop getmerge to another machine

问题:

回答1:

收藏的人(0)

hadoop getmerge to another machine

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮