How to “refresh” DOMDocument instances of LibXML2?

2019-06-12 12:41发布

问题:

Using PHP to illustrate: there are a BUG in the normalizeDocument() method, or a lack of a "refresh" method, because DOM consistence is lost after changes (even only attribute changes)... So, any algorithm "with DOM changes" that you implement with LIBXML2 somethimes works and sometimes not, is unpredictable!! (?)

The "refresh" by $doc->LoadXML($doc->saveXML()); is a workaround and lost performance in a flow of work with DOM... A sub-question: all moment I need to refresh DOM?

  $XML = '
  <html>
    <h1>Hello</h1>
    <ol>
        <li>test (no id)</li>
        <li xml:id="i2">test i2</li>
    </ol>
  </html>
  ';
  $doc = new DOMDocument;
  $doc->LoadXML($XML);
  doSomeChange($doc);    // here DOM is modified
  print $doc->saveXML(); // show new DOM state

  $doc->normalizeDocument(); // NOT REFRESHING!?!
  var_dump($doc->getElementById('i2'));  //NULL!??! is a BUG!
  //CAN_NOT_doMORESomeChange($doc);

  $doc->LoadXML($doc->saveXML());        // only way to refresh?
  print $doc->getElementById('i2')->tagName;  //OK, is there

  // illustrating attribute modification:
  function doSomeChange(&$dom) {
    $max = 0;
    $xp  = new DOMXpath($dom);
    foreach(iterator_to_array($xp->query('/html/* | //li')) as $e) {
        $max++;
        $e->setAttribute('xml:id',"i$max");
    }
    print "\ncmpDOM='".($xp->document === $dom)."'\n"; // after @ThomasWeinert
  }

So, input is the $XML and output is

  <html>
            <h1 xml:id="i1">Hello</h1>
            <ol xml:id="i2">
                <li xml:id="i3">test (no id)</li>
                <li xml:id="i4">test i2</li>
            </ol>
        </html>
  NULL
  ol

the NULL is the bug (see code comments).

PS: if I change input line <li xml:id="i2">test i2</li> to <li>test i2</li> the algorithm works as expected (!), so, is unpredictable.


Related questions: In DomDocument, reuse of DOMXpath, it is stable? PHP DomDocument, reuse of XSLTProcessor, it is stable/secure?

回答1:

Changes are applied to the DOM the moment you're doing them. In your example this creates a status where two elements have the same xml:id and this seems to screw up the index. Remove the xml:id attributes before setting them and it works:

$XML = '
  <html>
    <h1>Hello</h1>
    <ol>
        <li>test (no id)</li>
        <li xml:id="i2">test i2</li>
    </ol>
  </html>
  ';
  $doc = new DOMDocument;
  $doc->loadXML($XML);
  var_dump($doc->getElementById('i2'), $doc->getElementById('i2')->tagName);
  /*
    object(DOMElement)#2 (0) { }
    string(2) "li"
  */

  doSomeChange($doc);    // here DOM is modified

  var_dump($doc->getElementById('i2'), $doc->getElementById('i2')->tagName);
  /*
    object(DOMElement)#6 (0) { }
    string(2) "ol"
  */

  print $doc->saveXML(); // show new DOM state
  /*
  <?xml version="1.0"?>
  <html>
    <h1 xml:id="i1">Hello</h1>
    <ol xml:id="i2">
      <li xml:id="i3">test (no id)</li>
      <li xml:id="i4">test i2</li>
    </ol>
  </html>
  */

  // illustrating xml:id attribute modification:
  function doSomeChange($dom) {
    $xp  = new DOMXpath($dom);
    foreach($xp->evaluate('//*') as $e) {
      $e->removeAttribute('xml:id');
    }
    $max = 0;
    foreach($xp->evaluate('/html/*|//li') as $e) {
      $max++;
      $e->setAttribute('xml:id',"i$max");
    }
  }

Your specific dom modification is, what breaks the getElementById() calls.

To the "stability" question: The connection between DOMXpath and DOMDocument is not completly "stable". If you're using a load*() method in the DOMDocument, the connection is lost. You can validate that the DOMXpath uses the correct DOMDocument comparing its document property:

var_dump($xpath->document === $doc);

This does not happen in your case, because you always create a new instance of DOMXpath in the function. But it means you should avoid reloading the document because this will break xpath instances created for the document.