im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum
It follows all subforums and extracts Information from threads.
The problem: During crawling it gives me an error on ultiple threadlinks:
2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed
The URL is fine except for this part /%0A%09%09
It gives an 404 error.
I dont know why the program keeps adding the code to the end of the URL
Heres my code:
def urlfunc(value):
value = value.replace("%0A", "")
value = value.replace("%09", "")
return value
class spidermider(CrawlSpider):
name = 'memberspider'
allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
# 'http://www.musiker-board.de/'
] # urls from which the spider will start crawling
rules = (
Rule(LinkExtractor(allow=(r'forum/\w*',))),
Rule(LinkExtractor(allow=(r'threads/\w+',),deny=(r'threads/\w+/[\W\d]+'),process_value=urlfunc), callback='parse_thread' ),
)
Does someone have a explanation why it keeps happening?(And a solution to it)
EDIT:
updated code
If you do some manual debugging and research you will find that the values at the end of the URL are meta-characters. %0A
is a line feed, %09
is a horizontal tab: http://www.w3schools.com/tags/ref_urlencode.asp
Then if you enrich your urlfunc
function with manual debug statements (and increase the log-level to INFO
to see the results better) then you will see that the URLs do not end with these characters as a string just are converted when calling it as a website.
def urlfunc(value):
print 'orgiginal: ', value
value = value.replace('%0A', '').replace('%09', '')
print 'replaced: ', value
return value
This resulst in the following output:
orgiginal: http://www.musiker-board.de/posts/7609325/
replaced: http://www.musiker-board.de/posts/7609325/
orgiginal: http://www.musiker-board.de/members/martin-hofmann.17/
replaced: http://www.musiker-board.de/members/martin-hofmann.17/
The lines between the first result and the second one are there in the output because they have the meta-characters.
So the solution is to strip
the values:
def urlfunc(value):
return value.strip()
In this case you do not get any debug messages which tell you that the site was not found.
This may happen if whitespace and tabs are in the html code.
You could clean the URL
by using process_value
of LinkExtractor
and do something like:
...
Rule(LinkExtractor(allow=(r'threads/\w+',)), callback='parse_thread', process_value=clean_url)
...
def clean_url(value):
value = value.replace(u'%0A', '')
value = value.replace(u'%09', '')
return value