XML parser error: entity not defined

2019-01-09 02:33发布

站内文章 / PHP

16 0

冷血范

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

From what I gather, there's a few options I've seen:

I can find and replace all   and swap them out with   or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

Thanks, Ryan

UPDATE

I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

The solution to all this was quite simple:

I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

回答1:

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:

Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().

$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>'; 
$xmlHeader = new SimpleXMLElement($headerDoc);

Now the above code does not throw any undefined entity errors.

回答2:

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like:   →  ). In any case — simply using un-sanitized user input is a bad idea.

All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, ').

Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.

Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

回答3:

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.

This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.

2. I can place the code in question within a CDATA section.

In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.

3. I can include these entities within the XML file.

You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.

One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.

4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.

5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent; In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do

myDiv.innerHTML = myField.value;

which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.

Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.

Whether you want to do this fix in the browser or on the server (as @Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

回答4:

If you want to convert all characters, this may help you (I wrote it a while back) :

http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml

function _convertAlphaEntitysToNumericEntitys($entity) {
  return '&#'.ord(html_entity_decode($entity[0])).';';
}

$content = preg_replace_callback(
  '/&([\w\d]+);/i',
  '_convertAlphaEntitysToNumericEntitys',
  $content);

function _convertAsciOver127toNumericEntitys($entity) {
  if(($asciCode = ord($entity[0])) > 127)
    return '&#'.$asciCode.';';
  else
    return $entity[0];
}

$content = preg_replace_callback(
  '/[^\w\d ]/i',
  '_convertAsciOver127toNumericEntitys', $content);

回答5:

This question is a general problem for any language that parses XML or JSON (so, basically, every language).

The above answers are for PHP, but a Perl solution would be as easy as...

my $excluderegex =
    '^\n\x20-\x20' .   # Don't Encode Spaces
       '\x30-\x39' .   # Don't Encode Numbers
       '\x41-\x5a' .   # Don't Encode Capitalized Letters
       '\x61-\x7a' ;   # Don't Encode Lowercase Letters

    # in case anything is already encoded
$value = HTML::Entities::decode_entities($value);

    # encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);