I would like to create a page where all images which reside on my website are listed with title and alternative representation.
I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src
, title
and alt
from this HTML:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).
How about using a regular expression to find the img tags (something like
"<img[^>]*>"
), and then, for each img tag, you could use another regular expression to find each attribute.Maybe something like
" ([a-zA-Z]+)=\"([^"]*)\""
to find the attributes, though you might want to allow for quotes not being there if you're dealing with tag soup... If you went with that, you could get the parameter name and value from the groups within each match.Just to give a small example of using PHP's XML functionality for the task:
I did use the
DOMDocument::loadHTML()
method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to aSimpleXMLElement
is not necessary - it just makes using xpath and the xpath results more simple.If you want to use regEx why not as easy as this:
This will return something like: