I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly.
I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so:
$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>
I would like to fetch all the options on the site, in this manner:
array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);
Now, I know this can easily be done by manipulating the string, slicing it an dicing it, searching for substrings within each string, and so on, until I have everything I need. But I'm certain there must be a simpler way of doing it with regex, which should fetch all the results from a given string instantly. Can anyone help me find a shortcut for this? I have searched the web's finest regex sites, but to no avail.
Many thanks
See Best methods to parse HTML. Find the DOM solution below:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com');
$options = array();
foreach($dom->getElementsByTagName('option') as $option) {
$options[$option->nodeValue] = $option->getAttribute('value');
}
This can be done with Regex too, but I dont find it practical to write a reliable HTML parser with Regex when there is plenty of native and 3rd party parsers readily available for PHP.
I think it would be easier to use DomXPath, rather than use Regular expressions for this.
You could try something like this (not tested so might need some tweaks)...
<?php
$content = '<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
</select>
</form>';
$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$options = $xpath->evaluate("/html/body//option");
for ($i = 0; $i < $options->length; $i++) {
$option = $options->item($i);
$values[] = $option->getAttribute('value');
}
var_dump($values);
?>
<?php
$content = '<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
</select>
</form>';
preg_match_all('@<option value=\"(.*)\">(.*)</option>@', $content,$matches);
echo "<pre>";
print_r($matches);
?>
Now $matches contains the arrays you are looking for and you can process them to the result one very easily.
Using SimpleXML:
libxml_use_internal_errors(true);
$load = simplexml_load_string($content);
foreach ($load->xpath('//select/option') as $path)
var_dump((string)$path[0]);
If it's really coherent HTML then a simple regex will do:
preg_match('/<option\s+value="([^">]+)">([^<]+)/i', ...
However it's often simpler and more reliable to use phpQuery or QueryPath.
$options = qp($html)->find("select[name=city]")->find("option");
foreach ($options as $o) {
$result[ $o->attr("value") ] = $o->text();
}