Is there a web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Check this page out for a Ruby library: Ruby Mechanize
I'd like to mention that you would still be responsible for the way in which your crawler traverses sites.
回答2:
http://phpcrawl.cuab.de/
回答3:
you can go for webrat or watir in ruby, much easier than mechanize
回答4:
If you'd like to learn basic web crawler & search things, you can start look at "luna engine".
回答5:
If you need to scrape web pages that use javascript you can use Capybara with a driver which will spin up a real browser, such as poltergeist. Its usually used with a testing framework for acceptance testing, but can also be used outside a testing framework.