Error when running python map reduce job using Had

2020-06-27 04:43发布

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.

I tried to run this command

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat

But I got this error output :

File: /home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py does not exist, or is not readable.

Try -help for more information Streaming Command Failed!

Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?

Thank You

1条回答
我命由我不由天
2楼-- · 2020-06-27 05:11

The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.

Your invocation would become:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
    -mapper mapper_prod_cat.py \
    -reducer reducer_prod_cat.py \
    -input gs://bucket-name/intro_to_mapreduce/purchases.txt \
    -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
查看更多
登录 后发表回答