I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.
I tried to run this command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
But I got this error output :
File: /home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py does not exist, or is not readable.
Try -help for more information Streaming Command Failed!
Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?
Thank You
The
-file
option from hadoop-streaming only works for local files. Note however, that its help text mentions that the-file
flag is deprecated in favor of the generic-files
option. Using the generic-files
option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.Your invocation would become: