im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum
It follows all subforums and extracts Information from threads.
The problem: During crawling it gives me an error on ultiple threadlinks:
2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed
The URL is fine except for this part /%0A%09%09
It gives an 404 error.
I dont know why the program keeps adding the code to the end of the URL
Heres my code:
def urlfunc(value):
value = value.replace("%0A", "")
value = value.replace("%09", "")
return value
class spidermider(CrawlSpider):
name = 'memberspider'
allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
# 'http://www.musiker-board.de/'
] # urls from which the spider will start crawling
rules = (
Rule(LinkExtractor(allow=(r'forum/\w*',))),
Rule(LinkExtractor(allow=(r'threads/\w+',),deny=(r'threads/\w+/[\W\d]+'),process_value=urlfunc), callback='parse_thread' ),
)
Does someone have a explanation why it keeps happening?(And a solution to it)
EDIT: updated code
If you do some manual debugging and research you will find that the values at the end of the URL are meta-characters.
%0A
is a line feed,%09
is a horizontal tab: http://www.w3schools.com/tags/ref_urlencode.aspThen if you enrich your
urlfunc
function with manual debug statements (and increase the log-level toINFO
to see the results better) then you will see that the URLs do not end with these characters as a string just are converted when calling it as a website.This resulst in the following output:
The lines between the first result and the second one are there in the output because they have the meta-characters.
So the solution is to
strip
the values:In this case you do not get any debug messages which tell you that the site was not found.
This may happen if whitespace and tabs are in the html code.
You could clean the
URL
by usingprocess_value
ofLinkExtractor
and do something like: