So how does one tell spark which UTF to use when using saveAsTextFile(path)
? Of course if it's known that all the Strings are UTF-8 then it will save space on disk by 2x! (assuming the default UTF is 16 like java)
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
saveAsTextFile
actually uses Text
from hadoop which is encoded as UTF-8.
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
}
From Text.java:
public class Text extends BinaryComparable
implements WritableComparable<BinaryComparable> {
static final int SHORT_STRING_MAX = 1024 * 1024;
private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
new ThreadLocal<CharsetEncoder>() {
protected CharsetEncoder initialValue() {
return Charset.forName("UTF-8").newEncoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
new ThreadLocal<CharsetDecoder>() {
protected CharsetDecoder initialValue() {
return Charset.forName("UTF-8").newDecoder().
onMalformedInput(CodingErrorAction.REPORT).
onUnmappableCharacter(CodingErrorAction.REPORT);
}
};
If you wanted to save as UTF-16 I think you could use saveAsHadoopFile
with org.apache.hadoop.io.BytesWritable
and get the bytes of a java String
(which as you said will be UTF-16). Something like this:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
You can get the bytes from "...".getBytes("UTF-16")