XMLReader encoding error

2019-07-26 08:36发布

I have a PHP script which is trying to parse a huge XML file. To do this I'm using the XMLReader library. During the parsing, I have this encoding error:

Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x32 0x36 0x30

I would like to know if they are a way to skip records with bad characters.

Thanks!

4条回答
冷血范
2楼-- · 2019-07-26 08:54

If your XML file has really simple structure, you may "prefilter" it to get rid (or even better, correct) the wrong records.

Read it record by record and write out a filtered xml file, then process the filtered file.

查看更多
Anthone
3楼-- · 2019-07-26 08:56

I would listen to what XMLReader is telling you. Remember that many encodings are supersets of ASCII, so (for example) UTF-8 and ISO-8859-1 are identical to ASCII for the first 128 code points. It may well be that your file is really encoded as ISO-8859-1, but almost all of the characters in are from the lower, ASCII half of that character set. In that case, the error would be yours for letting it use the default encoding for XML, UTF-8.

In ISO-8859-1 the byte sequence 0xA0 0x32 0x36 0x30 is perfectly valid: a non-breaking space followed by '2', '6', '0'.

查看更多
爷的心禁止访问
4楼-- · 2019-07-26 09:00
$xml = file_get_contents('myxml.xml');
$xml = preg_replace('/[\x0-\x1f\x7f-\x9f]/u', ' ', $xml);
//parse $xml below

查看更多
仙女界的扛把子
5楼-- · 2019-07-26 09:01

First of all, make sure that your XML file is indeed UTF-8 encoded. If not specify the encoding as the second parameter to XMLReader::open().

If the encoding error is due a real malformed byte sequence in an UTF-8 document and if you're using PHP > 5.2.0 you could pass LIBXML_NOERROR and/or (depending on the error level) LIBXML_NOWARNING as a bitmask to the third parameter of XMLReader::open():

$xml = new XMLReader(); 
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING); 

If your're using PHP > 5.1.0 you can tweak the libXML error-handling.

// enable user error handling
libxml_use_internal_errors(true);
/* ... do your XML processing ... */
$errors = libxml_get_errors();
foreach ($errors as $error) {
    // handle errors here
}
libxml_clear_errors();

I actually don't know if the preceding two work-arounds actually allow XMLReader to continue reading in case of an error or if they only suppress the error output. But it's worth a try.


Responding to comment:

libXML defines XML_PARSE_RECOVER (1) but ext/libxml does not expose this constant as a PHP constant. Perhaps it's possible to pass the integer value 1 to the $options parameter.

$xml = new XMLReader(); 
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING | 1); 
查看更多
登录 后发表回答