How to extract img src, title and alt from html us

2018-12-31 03:24发布

I would like to create a page where all images which reside on my website are listed with title and alternative representation.

I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and alt from this HTML:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).

21条回答
听够珍惜
2楼-- · 2018-12-31 04:26

How about using a regular expression to find the img tags (something like "<img[^>]*>"), and then, for each img tag, you could use another regular expression to find each attribute.

Maybe something like " ([a-zA-Z]+)=\"([^"]*)\"" to find the attributes, though you might want to allow for quotes not being there if you're dealing with tag soup... If you went with that, you could get the parameter name and value from the groups within each match.

查看更多
临风纵饮
3楼-- · 2018-12-31 04:27

Just to give a small example of using PHP's XML functionality for the task:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

I did use the DOMDocument::loadHTML() method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElement is not necessary - it just makes using xpath and the xpath results more simple.

查看更多
人气声优
4楼-- · 2018-12-31 04:27

If you want to use regEx why not as easy as this:

preg_match_all('% (.*)=\"(.*)\"%Uis', $code, $matches, PREG_SET_ORDER);

This will return something like:

array(2) {
    [0]=>
    array(3) {
        [0]=>
        string(10) " src="abc""
        [1]=>
        string(3) "src"
        [2]=>
        string(3) "abc"
    }
    [1]=>
    array(3) {
        [0]=>
        string(10) " bla="123""
        [1]=>
        string(3) "bla"
        [2]=>
        string(3) "123"
    }
}
查看更多
登录 后发表回答