Spark: importing text file in UTF-8 encoding

2019-08-17 08:47发布

问题:

I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") sc.textFile("/file/path/samele_file.txt")

But upon reading the contents, these special characters are not recognized.

I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:

sc.textFile("/file/path/samele_file.txt",mode="utf-8")`

回答1:

No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read. You can refer to Reading file in different formats



回答2:

Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not