How to saveHTML of DOMDocument without HTML wrappe

2019-01-01 00:36发布

I'm the function below, I'm struggling to output the DOMDocument without it appending the XML, HTML, body and p tag wrappers before the output of the content. The suggested fix:

$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));

Only works when the content has no block level elements inside it. However, when it does, as in the example below with the h1 element, the resulting output from saveXML is truncated to...

<p>If you like</p>

I've been pointed to this post as a possible workaround, but I can't understand how to implement it into this solution (see commented out attempts below).

Any suggestions?

function rseo_decorate_keyword($postarray) {
    global $post;
    $keyword = "Jasmine Tea"
    $content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
    $d = new DOMDocument();
    @$d->loadHTML($content);
    $x = new DOMXpath($d);
    $count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
    if ($count > 0) return $postarray;
    $nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
    if ($nodes && $nodes->length) {
        $node = $nodes->item(0);
        // Split just before the keyword
        $keynode = $node->splitText(strpos($node->textContent, $keyword));
        // Split after the keyword
        $node->nextSibling->splitText(strlen($keyword));
        // Replace keyword with <b>keyword</b>
        $replacement = $d->createElement('strong', $keynode->textContent);
        $keynode->parentNode->replaceChild($replacement, $keynode);
    }
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}

24条回答
牵手、夕阳
2楼-- · 2019-01-01 01:12

Just remove the nodes directly after loading the document with loadHTML():

# remove <!DOCTYPE 
$doc->removeChild($doc->doctype);           

# remove <html><body></body></html> 
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
查看更多
余生请多指教
3楼-- · 2019-01-01 01:13

I'm a bit late in the club but didn't want to not share a method I've found out about. First of all I've got the right versions for loadHTML() to accept these nice options, but LIBXML_HTML_NOIMPLIED didn't work on my system. Also users report problems with the parser (for example here and here).

The solution I created actually is pretty simple.

HTML to be loaded is put in a <div> element so it has a container containing all nodes to be loaded.

Then this container element is removed from the document (but the DOMElement of it still exists).

Then all direct children from the document are removed. This includes any added <html>, <head> and <body> tags (effectively LIBXML_HTML_NOIMPLIED option) as well as the <!DOCTYPE html ... loose.dtd"> declaration (effectively LIBXML_HTML_NODEFDTD).

Then all direct children of the container are added to the document again and it can be output.

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();

$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);

$container = $container->parentNode->removeChild($container);

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

$htmlFragment = $doc->saveHTML();

XPath works as usual, just take care that there are multiple document elements now, so not a single root node:

$xpath = new DOMXPath($doc);
foreach ($xpath->query('/p') as $element)
{   #                   ^- note the single slash "/"
    # ... each of the two <p> element

  • PHP 5.4.36-1+deb.sury.org~precise+2 (cli) (built: Dec 21 2014 20:28:53)
查看更多
无与为乐者.
4楼-- · 2019-01-01 01:13

I have PHP 5.3 and the answers here did not work for me.

$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild); replaced all the document with only the first child, I had many paragraphs and only the first was being saved, but the solution gave me a good starting point to write something without regex I left some comments and I am pretty sure this can be improved but if someone has the same problem as me it can be a good starting point.

function extractDOMContent($doc){
    # remove <!DOCTYPE
    $doc->removeChild($doc->doctype);

    // lets get all children inside the body tag
    foreach ($doc->firstChild->firstChild->childNodes as $k => $v) {
        if($k !== 0){ // don't store the first element since that one will be used to replace the html tag
            $doc->appendChild( clone($v) ); // appending element to the root so we can remove the first element and still have all the others
        }
    }
    // replace the body tag with the first children
    $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
    return $doc;
}

Then we could use it like this:

$doc = new DOMDocument();
$doc->encoding = 'UTF-8';
$doc->loadHTML('<p>Some html here</p><p>And more html</p><p>and some html</p>');
$doc = extractDOMContent($doc);

Note that appendChild accepts a DOMNode so we do not need to create new elements, we can just reuse existing ones that implement DOMNodesuch as DOMElement this can be important to keep code "sane" when manipulating multiple HTML/XML documents

查看更多
裙下三千臣
5楼-- · 2019-01-01 01:16

Much like other members, I first revelled in the simplicity and awesome power of @Alessandro Vendruscolo answer. The ability to simply pass in some flagged constants to the constructor seemed too good to be true. For me it was. I have the correct versions of both LibXML as well as PHP however no matter what it still would add the HTML tag to the node structure of the Document object.

My solution worked way better than using the...

$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Flags or....

# remove <!DOCTYPE 
$doc->removeChild($doc->firstChild);            

# remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

Node Removal, which gets messy without a structured order in the DOM. Again code fragments have no way to predetermine DOM structure.

I started this journey wanting a simple way to do DOM traversal how JQuery does it or at least in some fashion that had a structured data set either singly linked, doubly linked or tree'd node traversal. I didn't care how as long as I could parse a string the way HTML does and also have the amazing power of the node entity class properties to use along the way.

So far DOMDocument Object has left me wanting... As with many other programmers it seems... I know I have seen a lot of frustration in this question so since I FINALLY.... (after roughly 30 hours of try and fail type testing) I have found a way to get it all. I hope this helps someone...

First off, I am cynical of EVERYTHING... lol...

I would have went a lifetime before agreeing with anyone that a third party class is in anyway needed in this use case. I very much was and am NOT a fan of using any third party class structure however I stumbled onto a great parser. (about 30 times in Google before I gave in so don't feel alone if you avoided it because it looked lame of unofficial in any way...)

If you are using code fragments and need the, code clean and unaffected by the parser in any way, without extra tags being used then use simplePHPParser.

It's amazing and acts a lot like JQuery. I not often impressed but this class makes use of a lot of good tools and I have had no parsing errors as of yet. I am a huge fan of being able to do what this class does.

You can find its files to download here, its startup instructions here, and its API here. I highly recommend using this class with its simple methods that can do a .find(".className") the same way a JQuery find method would be used or even familiar methods such as getElementByTagName() or getElementById()...

When you save out a node tree in this class it doesn't add anything at all. You can simply say $doc->save(); and it outputs the entire tree to a string without any fuss.

I will now be using this parser for all, non-capped-bandwidth, projects in the future.

查看更多
永恒的永恒
6楼-- · 2019-01-01 01:16

my server got php 5.3 and can't upgrade so those options

LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD

are not for me.

To solve this i tell to the SaveXML Function to print the Body element and then just replace the "body" with "div"

here is my code, hope it's helping someone:

<? 
$html = "your html here";
$tabContentDomDoc = new DOMDocument();
$tabContentDomDoc->loadHTML('<?xml encoding="UTF-8">'.$html);
$tabContentDomDoc->encoding = 'UTF-8';
$tabContentDomDocBody = $tabContentDomDoc->getElementsByTagName('body')->item(0);
if(is_object($tabContentDomDocBody)){
    echo (str_replace("body","div",$tabContentDomDoc->saveXML($tabContentDomDocBody)));
}
?>

the utf-8 is for Hebrew support.

查看更多
不再属于我。
7楼-- · 2019-01-01 01:18

I maybe too late. But maybe somebody (like me) still has this issue.
So, none of the above worked for me. Because $dom->loadHTML also close open tags as well, not only add html and body tags.
So add a < div > element is not working for me, because I have sometimes like 3-4 unclosed div in the html piece.
My solution:

1.) Add marker to cut, then load the html piece

$html_piece = "[MARK]".$html_piece."[/MARK]";
$dom->loadHTML($html_piece);

2.) do whatever you want with the document
3.) save html

$new_html_piece = $dom->saveHTML();

4.) before you return it, remove < p >< /p > tags from marker, strangely it is only appear on [MARK] but not on [/MARK]...!?

$new_html_piece = preg_replace( "/<p[^>]*?>(\[MARK\]|\s)*?<\/p>/", "[MARK]" , $new_html_piece );

5.) remove everything before and after marker

$pattern_contents = '{\[MARK\](.*?)\[\/MARK\]}is';
if (preg_match($pattern_contents, $new_html_piece, $matches)) {
    $new_html_piece = $matches[1];
}

6.) return it

return $new_html_piece;

It would be a lot easier if LIBXML_HTML_NOIMPLIED worked for me. It schould, but it is not. PHP 5.4.17, libxml Version 2.7.8.
I find really strange, I use the HTML DOM parser and then, to fix this "thing" I have to use regex... The whole point was, not to use regex ;)

查看更多
登录 后发表回答