extract single string from HTML using Ruby/Mechani

2019-01-12 04:41发布

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.

Sample code:

 require 'rubygems'
 require 'mechanize'

   post_agent = WWW::Mechanize.new
    post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
    puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts  post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts post_page.parser.xpath('//[@id="post1960370"]/tbody/tr[1]/td/div[2]/text()')

all my attempts end with empty string or an error.


I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:

After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.

But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.

2条回答
迷人小祖宗
2楼-- · 2019-01-12 05:19

Radek. I'm going to show you how to fish.

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

And that's how you do it.

查看更多
Viruses.
3楼-- · 2019-01-12 05:36

I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again. if it still doesn't work ... then follow Wayne Conrad's process that's the best!

查看更多
登录 后发表回答