Illegal character in XML feed?

2019-03-01 07:38发布

问题:

I have created a Wordpress/WooCommerce plugin which creates an XML file from our products.

But in some rows there are illegal characters.

error on line 15622 at column 22: Input is not proper UTF-8, indicate encoding !
Bytes: 0x03 0xC3 0xB6 0x73

How can I solve this, so the XML is parsed correctly?

XML FEED FILE

The code for generating is something like:

$dom = new DOMDocument('1.0', 'UTF-8');

// create root element
$root = $dom->createElement("termeklista");
$dom->appendChild($root);
$dom->formatOutput=true;

then a while loop with filling the data. The issue is in the description tag.

// DESCRIPTION

$description = $dom->createElement("leiras");
$producta->appendChild($description);
// create CDATA section
$cdata = $dom->createCDATASection("\n".$loop->post->post_excerpt."\n");
$description->appendChild($cdata);

I have tried iconv, utf8_encode, custom function to replace the wrong characters, but I cannot figure it out what the issue.

The WooCommerce product post excerpt does not have any illegal characters in it.

回答1:

0x03 (aka ^C aka ETX aka end of transmission) is not an allowed character in XML :

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Therefore your data is not XML, and any conformant XML processor must report an error such as the one you received.

You must repair the data by removing any illegal characters by treating it as text, not XML, manually or automatically before using it with any XML libraries.



回答2:

So,

I was able to solve the issue with the stripInvalidXML() function in this question. Thanks for the autor. The XML is now valid.

stripInvalidXML from file