Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character. I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?
相关问题
- ruby 1.9 wrong file encoding on windows
- WebElement.getText() function and utf8
- Does specifying the encoding in javac yield the sa
- How to check if a string contain only UTF-8 charac
- Emoji are not being encoded correctly for output w
相关文章
- Spanish Characters in HTML Page Title
- Base64 Encoding: Illegal base64 character 3c
- read xml in UTF-8 in scala
- How to read the Content Type header and convert in
- Is it possible to have SQL Server convert collatio
- Python Saving JSON Files as UTF-8
- Does the img tag's alt attribute require encod
- WebClient DownloadString UTF-8 not displaying inte
In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:
The java code responsible for doing this is:
Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.
You are correct this is because of Text (http://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/io/Text.html) converts incoming data (Bytes) to UTF-8 automatically. To avoid this you should not work with Text.
That said you should use bytearray type instead of chararray (bytearray do not use Text and so no conversion is done). Since you don't specify any code, I'll provide an example for illustration:
this is what (likely) you did:
this is what you wanted to do: