THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt
not. The XML text will be (after transformed),
$xmlFrag = '<p>
Hello world! Let A<
B and A=∬dxdy</p>
';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B
.
Frustrated solutions
I try to use html_entity_decode
for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1
option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5
flag, for convert really all named entities.
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert
#_x_amp#;
to$amp;
, however unlikely its presence in the source XML.Note: I'm assuming default encoding is UTF-8
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
In your case you want it the other way around. Encode numbered entities as UTF-8:
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
Your method
My method
My method (with unicode to numbered entity):
My method (with numbered entity to unicode):
Try this function:
example usage:
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
xml_entity_decode()
below is the best (over any other) workaround.To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving: