Extract all the text and img tags from HTML in PHP

2019-02-21 08:23发布

Possible Duplicate:
Best methods to parse HTML with PHP

For a project I need to take a HTML page and extract all its text and img tags from it, and keep them in the same order they appear in the web page.

So for example, if the web page is:

<p>Hi</p>
<a href ="test.com" alt="a link"> text link</a>
<img src="test.png" />
<a href ="test.com"><img src="test2.png" /></a>

I would like to retrieve that information with this format:

text - Hi
Link1 - <a href ="test.com">text link</a>  notice without alt or other tag
Img1 - test.png  
Link2 - <a href ="test.com"><img src="test2.png" /></a>  again no tag

Is there a way to make that in PHP?

标签: php html parsing
2条回答
We Are One
2楼-- · 2019-02-21 08:57

I would use an HTML Parser to pull the information out of the website. Get reading.

查看更多
虎瘦雄心在
3楼-- · 2019-02-21 09:00

Is there a way to make that in php ?

Yes, you can first strip all tags you're not interested in and then use DOMDocument to remove all unwanted attributes. Finally you need to re-run strip_tags to remove tags added by DomDocument:

$allowed_tags = '<a><img>';
$allowed_attributes = array('href', 'src');

$html = strip_tags($html, $allowed_tags);
$dom = new DOMDocument();

$dom->loadHTML($html);

foreach($dom->getElementsByTagName('*') as $node)
{
    foreach($node->attributes as $attribute)
    {
        if (in_array($attribute->name, $allowed_attributes)) continue;
        $node->removeAttributeNode($attribute);
    }
}

$html = $dom->saveHTML($dom->getElementsByTagname('body')->item(0));
$html = strip_tags($html, $allowed_tags);

Demo

查看更多
登录 后发表回答