I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example
Source.
fromFile(filename)("UTF-8").
foreach(print);
throws:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read(BufferedReader.java:174)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:150)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.Source.foreach(Source.scala:181)
I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.
SOLUTION:
import java.nio.charset.CodingErrorAction
import scala.io.Codec
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
val src = Source.
fromFile(filename).
foreach(print)
Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.
I'm switching to a different codec if one fails.
In order to implement the pattern, I got inspiration from this other stackoverflow question.
I use a default List of codecs, and recursively go through them. If they all fail, I print out the scary bits:
I'm just learning Scala, so the code may not be optimal.
Use
ISO-8859-1
as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)Once you have the individual fields as separate strings, you can try
to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).
Edit: you will have to use
java.nio.charset.Charset.CharsetDecoder
if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.Scala's Codec has a decoder field which returns a
java.nio.charset.CharsetDecoder
:The solution for scala's Source (based on @Esailija answer):
The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.
In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.
This is how I managed to do it with java:
The invalid file is created with bytes:
Which is
hellö wörld
in UTF-8 with 4 invalid bytes mixed in.With
.REPLACE
you see the standard unicode replacement character being used:With
.IGNORE
, you see the invalid bytes ignored:Without specifying
.onMalformedInput
, you get