I am trying to extract the phone number and the address from this site using Nokogiri. Both of them are between <br>
tags. How can I do this?
In case the site is down, here is an excerpt of some of the HTML from which I wish to extract the phone number and address:
<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Alana's Cafe</strong><br>
<em>Cafe/Desserts </em>
<br>
650 348-0417
<br>
1408 Burlingame Ave
<br>
<a href="http://www.alanascafe.com/burlingame.html" target="_blank">http://www.alanascafe.com/burlingame.html</a>
</td><td align="right">
<a href="index.cfm?vid=44885" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>
<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Amber Moon Indian Restaurant and Bar</strong><br>
<em>Indian </em>
<br>
1425 Burlingame Ave
</td><td align="right">
<a href="index.cfm?vid=44872" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>
Simplest would be something like:
That means: For each em, map the text after each following sibling br element.
Update
To sort that into phone / address you could do:
Code
How It Works
Since the HTML is not semantically marked up, the first challenge is finding just the entries with the addresses. Viewing the source, we know that they are in
<td>
on the page, so we start with that://td
- Find<td>
anywhere in the document...However, this page is chock full of bad markup, and so we need to limit our search to just the correct table cells. In this case, the
<strong>
and<em>
tags are used consistently in every entry, and do not appear in any other cell that isn't desired://td[strong][em]
- ...but ensure that the<td>
has at least one<strong>
and at least one<em>
child element...Now, we want the text after the third
<br>
element, so first we select just the third<br>
child of each matching<td>
://td[strong][em]/br[3]
- ... then find the child<br>
elements, and pick only the third...And then we get the first text node that follows this
<br>
://td[strong][em]/br[3]/following-sibling::text()[1]
- ...find all later sibling text nodes of the<br>
, and pick only the first.This leaves us with an array of
Nokogiri::XML::Text
instances, and so we map this array to be the string text of each, and finally we map that array to one that has stripped off any leading and trailing whitespace. This is not the fastest way to do it, but it is both terse and clear, and speedy enough.Doing something similar for the phone number is left as an exercise for the reader.
Edit: Here's a variation that is slightly more robust, enough so to handle the entries that have no phone number: