Import external libraries in an Hadoop MapReduce s

I am running a python MapReduce script on top of Amazons EMR Hadoop implementation. As a result from the main scripts, I get item item similiarities. In an aftercare step, I want to split this output into a seperate S3 bucket for each item, so each item-bucket contains a list of items similiar to it. To achieve this, I want to use Amazons boto python library in the reduce function of the aftercare step.

How do I import external (python) libraries into hadoop, so that they can be used in a reduce step written in python?
Is it possible to access S3 in that way inside the Hadoop environment?

Thanks in advance, Thomas

标签： python amazon-web-services hadoop mapreduce amazon-emr

1条回答

叛逆

2楼-- · 2019-07-11 13:42

When launching a hadoop process you can specify external files that should be made available. This is done by using the -files argument.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat

I don't know if the files HAVE to be on the HDFS, but if it's a job that will be running often, it wouldn't be a bad idea to put them there.
From the code you can do something similar to

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            Path path = new File(localFile.toUri().getPath());
        }
    }
}

This is all but copy and pasted directly from working code inside multiple of our Mappers.

I don't know about the second part of your question. Hopefully the answer to the first part will get you started. :)

In addition to -files there is -libjars for including additional jars; I have a little information about here - If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

0人赞添加讨论(0) 举报

Import external libraries in an Hadoop MapReduce s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间