Control configure set Apache Spark UTF encoding fo

2019-05-25 21:45发布

So how does one tell spark which UTF to use when using saveAsTextFile(path)? Of course if it's known that all the Strings are UTF-8 then it will save space on disk by 2x! (assuming the default UTF is 16 like java)

回答1:

saveAsTextFile actually uses Text from hadoop which is encoded as UTF-8.

def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]) {
    this.map(x => (NullWritable.get(), new Text(x.toString)))
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path, codec)
  }

From Text.java:

public class Text extends BinaryComparable
    implements WritableComparable<BinaryComparable> {

  static final int SHORT_STRING_MAX = 1024 * 1024;

  private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
    new ThreadLocal<CharsetEncoder>() {
      protected CharsetEncoder initialValue() {
        return Charset.forName("UTF-8").newEncoder().
               onMalformedInput(CodingErrorAction.REPORT).
               onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

  private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
    new ThreadLocal<CharsetDecoder>() {
    protected CharsetDecoder initialValue() {
      return Charset.forName("UTF-8").newDecoder().
             onMalformedInput(CodingErrorAction.REPORT).
             onUnmappableCharacter(CodingErrorAction.REPORT);
    }
  };

If you wanted to save as UTF-16 I think you could use saveAsHadoopFile with org.apache.hadoop.io.BytesWritable and get the bytes of a java String (which as you said will be UTF-16). Something like this:
saveAsHadoopFile[SequenceFileOutputFormat[NullWritable, BytesWritable]](path)
You can get the bytes from "...".getBytes("UTF-16")