Get text outside of elements

2019-08-14 02:46发布

问题:

I am using Simple html dom to scrape a website. The problem I have run into is that there is text positioned outside of any specific element. The only element it seems to be inside is <div id="content">.

<div id="content">
    <div class="image-wrap"></div>
    <div class="gallery-container"></div>
    <h3 class="name">Here is the Heading</h3>

    All the text I want is located here !!!

    <p> </p>
    <div class="snapshot"></div>
</div>

I guess the webmaster has messed up and the text should actually be inside the <p> tags.

I've tried using this code below, however it just won't retrieve the text:

    $t = $scrape->find("div#content text",0);
    if ($t != null){
        $text = trim($t->plaintext);
    }

I'm still a newbie and still learning. Can anyone help at all ?

回答1:

You're almost there... Use a test loop to display the content of your nodes and locate the index of the wanted text. For example:

// Find all texts
$texts = $html->find('div#content text');

foreach ($texts as $key => $txt) {
    // Display text and the parent's tag name
    echo "<br/>TEXT $key is ", $txt->plaintext, " -- in TAG ", $txt->parent()->tag ;
}

You'll find that you should use index 4 instead of 0:

$scrape->find("div#content text",4);

And if your text doesnt have always the same index but you know for example that it follows the h3 heading, then you could use something like:

foreach ($texts as $key => $txt) {
    // Locate the h3 heading
    if ($txt->parent()->tag == 'h3') {
        // Grab the next index content from $texts
        echo $texts[$key+1]->plaintext;
        // Stop
        break;
    }
}