Extract data from HTML Table with mechanize

First of all, here is the sample html table :

 <tr>
   <td><strong>Kangchenjunga </strong></td>
   <td>8,586m<br /></td>
   <td>28,169ft</td>
   <td><div align="center">Nepal/India </div></td>
   <td>1955; G. Band, J. Brown </td>
 </tr>

The ARGV[0] will have the name of a mountain ( the first colomn) and the return value should be the last column, the people who climbed the mountain for the first time.

So I need to check if the whole rows first column is the ARGV[0], and if it is, then I should return the last column without the date.

require 'mechanize'
p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body
if p.include?('<strong>'+ARGV[0])
   puts 'ok'
end

I've got the following, which prints "ok" if I have the ARGV[0] in the body of the html document. How can I search for the last column of the same row, where the ARGV[0] is found?

EXAMPLE :

<tr>
 <td><strong>GIVE THIS AS A PARAMETER </strong></td>
 <td>SKIP THIS<br /></td>
 <td>SKIP THIS</td>
 <td><div align="center">SKIP THIS</div></td>
 <td>I WANT IT TO RETURN THIS</td>
</tr>

I'm really new to Ruby

标签： html ruby-on-rails ruby parsing mechanize

3条回答

迷人小祖宗

2楼-- · 2019-02-20 02:14

I believe this is what you want (you will need to gem install nokogiri)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
rows = doc.search('//table')[6]./('tr')
rows.shift
rows.shift

rows.each do |row|
  if row.text.include? ARGV[0]
    puts row./('td')[4].text.gsub(/.*?;/, '').strip   
  end
end

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2019-02-20 02:25

More succint version relying more on the black magic of XPath :)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")

puts last_td.text.gsub(/.*?;/, '').strip

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-02-20 02:37

The first mistake that I see is that you are calling the following:

p=Mechanize.new.get('www.alpineascents.com/8000m-peaks.asp').body

Unfortunately grabbing the body from the mechanize object will just return all the body text as you would find in the DOCTYPE body block.

This information is quite annoying to parse through so I would recommend doing the following. p=Mechanize.new.get('http://www.alpineascents.com/8000m-peaks.asp')

This will return a Mechanize#Page object which you an play with(http://mechanize.rubyforge.org/Mechanize/Page.html)

With that object we can simply perform a search which is nokogiris search by doing the following;

elems = p.search('tr')

this will return all the tr elements as a Nokogiri::XML::Element which we can use pretty cleanly to get the information that we want. Note that you may want to play around with all the stuff in IRB to figure out exactly what you need but the idea is should be clear from the following:

elems.first.search('td').last.text which will return the final td elements text from the first tr element we searched for before.

If you have any questions / want me to clarify feel free to ask away.

I have been hacking on things with mechanize for a long while now.

EDIT:

If you want to be able to look up the values this using some argument this is how I imagined you would solve the problem

values = {}
elems.each do |e|
  td = e.search('td')
  values[td.first.text] = td.last.text
end

When you have the values hash filled you can do the following:

if ARG[0] = "Everest"

then

> values["Everest"] => "1953; Sir E. Hillary, T. Norgay"

0人赞添加讨论(0) 举报

Extract data from HTML Table with mechanize

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间