How do I ignore file types in a web crawler?

2020-03-31 07:47发布

问题:

I'm writing a web crawler and want to ignore URLs which link to binary files:

$exclude = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml)

How can I check the URI against one of these endings?

@url = URI.parse(url)

should be set if it doesn't contain any of the suffixes above.

回答1:

use URI#path:

unless URI.parse(url).path =~ /\.(\w+)$/ && $exclude.include?($1)
  puts "downloading #{url}..."
end


回答2:

Ruby lacks a really useful module that Perl has, called Regexp::Assemble. Ruby's Regexp::Union comes nowhere near it. Here's how to use Regexp::Assemble, and its result:

use Regexp::Assemble;

my @extensions = sort qw(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml);

my $ra = Regexp::Assemble->new;
$ra->add(@extensions);

print $ra->re, "\n";

Which outputs:

(?-xism:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))

Perl supports the s flag and Ruby doesn't, so that needs to be taken out of ?-xism, and we want to ignore character case so the i needs to be moved, resulting in ?i-xm.

Plug that into a Ruby script as the regular expression:

REGEX = /(?i-xm:(?:m(?:p(?:[234]|e?g)|[1o]v|k[av]|3u)|a(?:s[fx]|iff|ac|c3|pe|vi)|p(?:p[st]|df|ng)|r(?:a[rw]|ss)|w(?:m[av]|av)|x(?:ls|ml|sd)|j(?:ar|pg|s)|d(?:oc|td)|g(?:if|z)|f[4l]v|bin|css|exe|ico|ogg|swf|tar|zip|7z))/

@url = URI.parse(url)

puts @url.path[REGEX]

uri = URI.parse('http://foo.com/bar.jpg')
uri.path        # => "/bar.jpg"
uri.path[REGEX] # => "jpg"

See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.



回答3:

You can strip off the URL's file extension with a regular expression or split (I've shown the latter here, but beware this will also match some malformed URLs, such as http://foo.exe), then use Array#include? to check for membership:

@url = URI.parse(url) unless $exclude.include?(url.split('.').last)