Is there a web crawler library available for PHP or Ruby? a library that can do it depth first or breadth first... and handle the links even when href="../relative_path.html" and base url is used.
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
If you need to scrape web pages that use javascript you can use Capybara with a driver which will spin up a real browser, such as poltergeist. Its usually used with a testing framework for acceptance testing, but can also be used outside a testing framework.
If you'd like to learn basic web crawler & search things, you can start look at "luna engine".
http://phpcrawl.cuab.de/
Check this page out for a Ruby library: Ruby Mechanize
I'd like to mention that you would still be responsible for the way in which your crawler traverses sites.
you can go for webrat or watir in ruby, much easier than mechanize