Encoding in Pig

2019-07-15 13:18发布

问题:

Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character. I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?

回答1:

In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:

    DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
    encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
    decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

The java code responsible for doing this is:

    import java.io.IOException;
    import java.net.URLDecoder;

    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;

    public class UrlDecode extends EvalFunc<String> {

        @Override
        public String exec(Tuple input) throws IOException {
            String encoded = (String) input.get(0);
            String encoding = (String) input.get(1);
            return URLDecoder.decode(encoded, encoding);
        }
    }

Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.



回答2:

You are correct this is because of Text (http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/Text.html) converts incoming data (Bytes) to UTF-8 automatically. To avoid this you should not work with Text.

That said you should use bytearray type instead of chararray (bytearray do not use Text and so no conversion is done). Since you don't specify any code, I'll provide an example for illustration:

  1. this is what (likely) you did:

    converted_to_utf = LOAD 'strangeEncodingdata' using TextLoader AS (line:chararray);
    
  2. this is what you wanted to do:

    no_conversion = LOAD 'strangeEncodingdata' using TextLoader AS (line:bytearray);