I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.
- How do I import external (python) libraries into hadoop, so that they can be used in a reduce step written in python?
- Is it possible to access S3 in that way inside the Hadoop environment?
Thanks in advance, Thomas
When launching a hadoop process you can specify external files that should be made available. This is done by using the
-files
argument.$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat
I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to
This is all but copy and pasted directly from working code inside multiple of our Mappers.
I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)
In addition to
-files
there is-libjars
for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?