How to read someone else's forum

2019-04-01 23:55发布

问题:

My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying.

I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no trouble viewing my friend's forum, it seems that the method Net::HTTP.new("forumname.net") produces the following error:

No connection could be made because the target machine actively refused it. - connect(2)

Googling that error, I have learned that it has to do with MySQL (or something like that) not wanting nosy guys like me remotely poking around in there: for security reasons. This makes sense to me, but it makes me wonder: how is it that my browser gets to poke around on my friend's forum, but my little Ruby script gets no poking rights. Is there some way for my script to tell the server that it is not a threat? That I only want reading rights and not writing rights?

Thanks guys,

z.

回答1:

Scraping a web site? Use mechanize:

#!/usr/bin/ruby1.8

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html    # If you want to see the html you just got
posts = page.parser.xpath("//div[@class='postbody']")
for post in posts
  title = post.at_xpath('h3//text()').to_s
  author = post.at_xpath("p[@class='author']//a//text()").to_s
  body = post.xpath("div[@class='content']//text()").collect do |div|
    div.to_s
  end.join("\n")
  puts '-' * 40
  puts "title: #{title}"
  puts "author: #{author}"
  puts "body:", body
end

The first part of the output:

----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
 We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.


回答2:

some sites can only be accessed with the "www" subdomain, so that may be causing the problem.

to create a get request, you would want to use the Get method:

require 'net/http'

url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
  http.request(req)
}
puts res.body

u might also need to set the user agent at some point as an option:

{'User-Agent' => 'Mozilla/5.0 (Windows; U;
    Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})