I have a PHP script which is trying to parse a huge XML file. To do this I'm using the XMLReader library. During the parsing, I have this encoding error:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x32 0x36 0x30
I would like to know if they are a way to skip records with bad characters.
Thanks!
First of all, make sure that your XML file is indeed UTF-8 encoded. If not specify the encoding as the second parameter to XMLReader::open()
.
If the encoding error is due a real malformed byte sequence in an UTF-8 document and if you're using PHP > 5.2.0 you could pass LIBXML_NOERROR
and/or (depending on the error level) LIBXML_NOWARNING
as a bitmask to the third parameter of XMLReader::open()
:
$xml = new XMLReader();
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING);
If your're using PHP > 5.1.0 you can tweak the libXML
error-handling.
// enable user error handling
libxml_use_internal_errors(true);
/* ... do your XML processing ... */
$errors = libxml_get_errors();
foreach ($errors as $error) {
// handle errors here
}
libxml_clear_errors();
I actually don't know if the preceding two work-arounds actually allow XMLReader
to continue reading in case of an error or if they only suppress the error output. But it's worth a try.
Responding to comment:
libXML
defines XML_PARSE_RECOVER
(1) but ext/libxml does not expose this constant as a PHP constant. Perhaps it's possible to pass the integer value 1
to the $options
parameter.
$xml = new XMLReader();
$xml->open('myxml.xml', null, LIBXML_NOERROR | LIBXML_NOWARNING | 1);
I would listen to what XMLReader is telling you. Remember that many encodings are supersets of ASCII, so (for example) UTF-8 and ISO-8859-1 are identical to ASCII for the first 128 code points. It may well be that your file is really encoded as ISO-8859-1, but almost all of the characters in are from the lower, ASCII half of that character set. In that case, the error would be yours for letting it use the default encoding for XML, UTF-8.
In ISO-8859-1 the byte sequence 0xA0 0x32 0x36 0x30
is perfectly valid: a non-breaking space followed by '2', '6', '0'.
If your XML file has really simple structure, you may "prefilter" it to get rid (or even better, correct) the wrong records.
Read it record by record and write out a filtered xml file, then process the filtered file.
$xml = file_get_contents('myxml.xml');
$xml = preg_replace('/[\x0-\x1f\x7f-\x9f]/u', ' ', $xml);
//parse $xml below