Reading a file in javascript via Apache Pig UDF

2019-08-21 11:21发布

问题:

I have some (very simplified) nodejs code here:

var fs = require('fs');

var derpfile = String(fs.readFileSync( './derp.txt', 'utf-8' ));
var derps    = derpfile.split( '\n' );
for (var i = 0; i < derps.length; ++i) {
    // do something with my derps here
}

The problem is, I cannot use node in Pig UDF's (that I am aware of; if I can do this, please let me know!). When I look at 'file io' in javascript, all the tutorials I see are in re the browser sandbox. I need to read a file off the filesystem, like hdfs:///foo/bar/baz/jane/derps.txt, which I cannot guarantee will be in the CWD, but which I will have permissions to get at. All these tutorials also seem to be involving asynchronous reads. I really need to have a blocking call here, as the pig job cannot begin until this file is read. There are also lots of explanations of how to pull down a URL from another site.

This is kind of incredibly frustrating as using Java for this task is horrific overkill, and javascript is really The Right Tool For The Job (well, okay, perl is, but I don't get to choose that…), and I'm hamstrung on something as simple as basic file IO. :(

回答1:

I can't speak to your use of JavaScript, since I've never written a UDF with it, but in general file access is not done inside of a UDF, especially if you are trying to access something on HDFS. Files on HDFS are accessed via the NameNode, so once you are executing on a DataNode, you are out of luck. You need to place the files in the distributed cache.

Pig can do this for you by doing a JOIN. If the file fits in memory, you can do a replicated join, which will leverage the distributed cache. I would use Pig to load the file into a relation, use GROUP relation ALL to get it into a single bag, and then CROSS this bag with all records in your relation of interest. Then you can pass this bag to any UDFs you like. Something like:

a = LOAD 'a' AS ...;
f = LOAD '/the/file/you/want' AS ...;

/* Put everything into a single bag */
f_bag = FOREACH (GROUP f ALL) GENERATE f;
/* Now you have a relation with one record;
   that record has one field: the bag, f */
a2 = CROSS a, f_bag;
/* Now you have duplicated a and appended
   the bag f to each record */

b = FOREACH a2 GENERATE yourUDF(field1, field2, f)