Apache Pig: Load a file that shows fine using hado

2019-02-09 16:39发布

问题:

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

What I've tried:

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

but that only gives me garbage. How can I view the file using pig?

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

回答1:

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);


回答2:

If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.

You can find here examples how to read/write them.

If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .

Note, that Elephant-Bird needs to have an installed Thrift in your machine. Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:

<plugin>
  <groupId>org.apache.thrift.tools</groupId>
  <artifactId>maven-thrift-plugin</artifactId>
  <version>0.1.10</version>
  <configuration>
    <thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
  </configuration>
</plugin>