Extracting between
tags with Nokogiri?

2020-05-10 00:13发布

问题:

I am trying to extract the phone number and the address from this site using Nokogiri. Both of them are between <br> tags. How can I do this?


In case the site is down, here is an excerpt of some of the HTML from which I wish to extract the phone number and address:

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Alana's Cafe</strong><br>
<em>Cafe/Desserts </em>
<br>
650 348-0417
<br>
1408 Burlingame Ave
<br>
<a href="http://www.alanascafe.com/burlingame.html" target="_blank">http://www.alanascafe.com/burlingame.html</a>

</td><td align="right">
<a href="index.cfm?vid=44885" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Amber Moon Indian Restaurant and Bar</strong><br>
<em>Indian </em>

<br>
1425 Burlingame Ave


</td><td align="right">
<a href="index.cfm?vid=44872" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

回答1:

Simplest would be something like:

data = doc.search('em').map{|em| em.search('~ br').map{|br| br.next.text.strip}}
#=> [["650 348-0417", "1408 Burlingame Ave", "http://www.alanascafe.com/burlingame.html"], etc...

That means: For each em, map the text after each following sibling br element.

Update

To sort that into phone / address you could do:

data.map{|row| {:phone => row[0][/^[\d \(\)-]+$/] ? row.shift : nil, :address => row.shift}}
#=> [{:phone=>"650 348-0417", :address=>"1408 Burlingame Ave"}, etc...


回答2:

Code

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://map.burlingamedowntown.org/textdir.cfm?p=1213'))
addresses = doc.xpath('//td[strong][em]/br[3]/following-sibling::text()[1]')
p addresses.map(&:text).map(&:strip)
#=> #=> ["1408 Burlingame Ave", "347 Primrose Rd", "305 California Dr", "1409 Burlingame Avenue", "260 Lorton Ave", "1219 Burlingame Avenue", "1108 Burlingame Avenue", "1212 Donnelly Ave", "1243 Howard Ave", "283 Lorton Avenue", "245 California Drive", "1107 Howard Ave", "1300 Howard Ave", "1216 Burlingame Avenue", "1310 Burlingame Ave", "322 Lorton Avenue", "203 Primrose Dr", "1125 Burlingame Avenue", "327 Lorton Avenue", "1451 Burlingame Ave", "221 Primrose Rd", "1101 Burlingame Ave", "", "1123 Burlingame Avenue", "1407 Burlingame Ave", "1318 Burlingame Avenue", "1213 Burlingame Avenue", "231 Park Road", "246 Lorton Ave", "1453 Burlingame Ave", "1309 Burlingame Avenue", "321 Primrose Road", "", "209 Park Road", "1207 Burlingame Avenue", "1090 Burlingame Avenue", "1223 Donnelly Ave", "243 California Dr", "1080 Howard Ave", "270 Lorton Ave", "1447 Burlingame Ave", "361 California Drive", "1160 Burlingame Avenue", "333 California Drive", "401 Primrose Road", "1100 Burlingame Avenue", "1100Howard Ave #D", "1309 Burlingame Avenue", "220 Lorton Ave", "", "1101 Howard Avenue", "266 Lorton Avenue", "240 Park Rd", "1118 Burlingame Ave", "221 Park Road", "1400 Howard Ave", "225 Primrose Road", "248 Lorton Avenue"]

How It Works

Since the HTML is not semantically marked up, the first challenge is finding just the entries with the addresses. Viewing the source, we know that they are in <td> on the page, so we start with that:

  • //td - Find <td> anywhere in the document...

However, this page is chock full of bad markup, and so we need to limit our search to just the correct table cells. In this case, the <strong> and <em> tags are used consistently in every entry, and do not appear in any other cell that isn't desired:

  • //td[strong][em] - ...but ensure that the <td> has at least one <strong> and at least one <em> child element...

Now, we want the text after the third <br> element, so first we select just the third <br> child of each matching <td>:

  • //td[strong][em]/br[3] - ... then find the child <br> elements, and pick only the third...

And then we get the first text node that follows this <br>:

  • //td[strong][em]/br[3]/following-sibling::text()[1] - ...find all later sibling text nodes of the <br>, and pick only the first.

This leaves us with an array of Nokogiri::XML::Text instances, and so we map this array to be the string text of each, and finally we map that array to one that has stripped off any leading and trailing whitespace. This is not the fastest way to do it, but it is both terse and clear, and speedy enough.

Doing something similar for the phone number is left as an exercise for the reader.


Edit: Here's a variation that is slightly more robust, enough so to handle the entries that have no phone number:

# Make all the `<br>` be real "\r\n".
doc.xpath('//td[strong][em]/br').each{ |br| br.replace("\r\n") }

# Get the text inside each entry
entries = doc.xpath('//td[strong][em]').map(&:text)

# Change the multi-line string into an array of lines
entries = entries.map{ |entry| entry.strip.split(/(?:\r\n)+/).map(&:strip) }

# Find the first line in each that has no letters in it
phones = entries.map{ |entry_lines| entry_lines.grep(/^[^a-z]+$/i).first }

# Find the first line in each that has a string of digits followed by a letter
addresses = entries.map{ |entry_lines| entry_lines.grep(/\d+ [a-z]/i).first }

# Zip and iterate them together
phones.zip(addresses).each do |phone,address|
  puts "For %s call %s" % [address,phone || "-"]
end

#=> For 1408 Burlingame Ave call 650 348-0417
#=> For 1425 Burlingame Ave call -
#=> For 347 Primrose Rd call 650-548-0300
#=> For 305 California Dr call 650 340-8642
#=> For 1409 Burlingame Avenue call 650 348-1204
#=> ...