问题:

Is there a web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used.

回答1:

Check this page out for a Ruby library: Ruby Mechanize

I'd like to mention that you would still be responsible for the way in which your crawler traverses sites.

回答2:

http://phpcrawl.cuab.de/

回答3:

you can go for webrat or watir in ruby, much easier than mechanize

回答4:

If you'd like to learn basic web crawler & search things, you can start look at "luna engine".

回答5:

If you need to scrape web pages that use javascript you can use Capybara with a driver which will spin up a real browser, such as poltergeist. Its usually used with a testing framework for acceptance testing, but can also be used outside a testing framework.