DRY search every page of a site with nokogiri

2019-04-10 13:49发布

问题:

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.

So it starts very easily:

page = 'http://example.com'
nf = Nokogiri::HTML(open(page))

links = nf.xpath '//a' #find all links on current page

main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq 

"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).

From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:

main_links.each do |ml| 
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end

I'm still working out this last bit... but does this seem like the proper approach?

Thanks.

回答1:

Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:

"[…] but I don't know the best way to ensure I don't repeat myself"

Recursion is the key here. Something like the following code:

require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'

def crawl_site( starting_at, &each_page )
  files = %w[png jpeg jpg gif svg txt js css zip gz]
  starting_uri = URI.parse(starting_at)
  seen_pages = Set.new                      # Keep track of what we've seen

  crawl_page = ->(page_uri) do              # A re-usable mini-function
    unless seen_pages.include?(page_uri)
      seen_pages << page_uri                # Record that we've seen this
      begin
        doc = Nokogiri.HTML(open(page_uri)) # Get the page
        each_page.call(doc,page_uri)        # Yield page and URI to the block

        # Find all the links on the page
        hrefs = doc.css('a[href]').map{ |a| a['href'] }

        # Make these URIs, throwing out problem ones like mailto:
        uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact

        # Pare it down to only those pages that are on the same site
        uris.select!{ |uri| uri.host == starting_uri.host }

        # Throw out links to files (this could be more efficient with regex)
        uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }

        # Remove #foo fragments so that sub-page links aren't differentiated
        uris.each{ |uri| uri.fragment = nil }

        # Recursively crawl the child URIs
        uris.each{ |uri| crawl_page.call(uri) }

      rescue OpenURI::HTTPError # Guard against 404s
        warn "Skipping invalid link #{page_uri}"
      end
    end
  end

  crawl_page.call( starting_uri )   # Kick it all off!
end

crawl_site('http://phrogz.net/') do |page,uri|
  # page here is a Nokogiri HTML document
  # uri is a URI instance with the address of the page
  puts uri
end

In short:

  • Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
  • Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
  • Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.


回答2:

You are missing some things.

A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.

JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.

This:

links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq 

can be better written:

links.search('a[href^="/"]').map{ |a| a['href'] }.uniq

In general, don't do this:

....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq

because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.

Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.

Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.

Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.

Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.

Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.



回答3:

It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.