I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.
Here is an example:
<?php
$html = '
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
';
$xml = simplexml_load_string($html);
var_dump($xml);
?>
Result of var_dump() is:
object(SimpleXMLElement)#1 (2) {
["b"]=>
object(SimpleXMLElement)#2 (1) {
["span"]=>
string(10) "zot; zotz "
}
["span"]=>
string(39) "Nista; nula. Isto
"
}
As you can see - Simplexml kept text node inside tag but left out a child node and text inside.
I've also tried:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
with the same result.
As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.
There is just too generalized posts and answers about parsing HTML in SO.
Is there a simple way of dealing with this? Or, should I change my strategy?