Repairing wrong encoding in XML files

2019-04-10 11:39发布

One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:

DocumentBuilder.parse(ByteArrayInputStream bais) 

throws the following exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.

Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?

3条回答
迷人小祖宗
2楼-- · 2019-04-10 12:15

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:

DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));
查看更多
聊天终结者
3楼-- · 2019-04-10 12:18

You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.

Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.

You would also need a way to know when the broken system gets fixed so you can stop using your workaround.

查看更多
做自己的国王
4楼-- · 2019-04-10 12:24

You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.

查看更多
登录 后发表回答