Repairing wrong encoding in XML files

One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:

DocumentBuilder.parse(ByteArrayInputStream bais)

throws the following exception:

org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.

Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?

标签： java xml parsing encoding xerces

3条回答

迷人小祖宗

2楼-- · 2019-04-10 12:15

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:

DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-04-10 12:18

You should manually take a look at the invalid documents and see what is the common problem to them. It's quite probable they are in fact in another encoding (most probably windows-1252), and the best solution then would be to take every document from the broken system and recode it to UTF-8 before parsing.

Another possible cause is mixed encodings (the content of some elements is in one encoding and the content of other elements is in another encoding). That would be harder to fix.

You would also need a way to know when the broken system gets fixed so you can stop using your workaround.

0人赞添加讨论(0) 举报

做自己的国王

4楼-- · 2019-04-10 12:24

You should tell them to send you correct UTF-8. Failing that any solution should reencode the bad characters as valid UTF-8 then pass it to the parser. The reason for this is that if the bad characters are preserved then different programs might interpret any output different ways, which can lead to security holes.

0人赞添加讨论(0) 举报

Repairing wrong encoding in XML files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间