I have a PHP script which is trying to parse a huge XML file. To do this I'm using the XMLReader library. During the parsing, I have this encoding error:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x32 0x36 0x30
I would like to know if they are a way to skip records with bad characters.
Thanks!
If your XML file has really simple structure, you may "prefilter" it to get rid (or even better, correct) the wrong records.
Read it record by record and write out a filtered xml file, then process the filtered file.
I would listen to what XMLReader is telling you. Remember that many encodings are supersets of ASCII, so (for example) UTF-8 and ISO-8859-1 are identical to ASCII for the first 128 code points. It may well be that your file is really encoded as ISO-8859-1, but almost all of the characters in are from the lower, ASCII half of that character set. In that case, the error would be yours for letting it use the default encoding for XML, UTF-8.
In ISO-8859-1 the byte sequence
0xA0 0x32 0x36 0x30
is perfectly valid: a non-breaking space followed by '2', '6', '0'.First of all, make sure that your XML file is indeed UTF-8 encoded. If not specify the encoding as the second parameter to
XMLReader::open()
.If the encoding error is due a real malformed byte sequence in an UTF-8 document and if you're using PHP > 5.2.0 you could pass
LIBXML_NOERROR
and/or (depending on the error level)LIBXML_NOWARNING
as a bitmask to the third parameter ofXMLReader::open()
:If your're using PHP > 5.1.0 you can tweak the
libXML
error-handling.I actually don't know if the preceding two work-arounds actually allow
XMLReader
to continue reading in case of an error or if they only suppress the error output. But it's worth a try.Responding to comment:
libXML
definesXML_PARSE_RECOVER
(1) but ext/libxml does not expose this constant as a PHP constant. Perhaps it's possible to pass the integer value1
to the$options
parameter.