Scrapy: URL error, Program adds unnecessary charac

2019-09-14 15:40发布

问题:

im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum

It follows all subforums and extracts Information from threads.

The problem: During crawling it gives me an error on ultiple threadlinks:

2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed

The URL is fine except for this part /%0A%09%09

It gives an 404 error.

I dont know why the program keeps adding the code to the end of the URL

Heres my code:

def urlfunc(value):
    value = value.replace("%0A", "")
    value = value.replace("%09", "")
    return value

class spidermider(CrawlSpider):
name = 'memberspider'
allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
              # 'http://www.musiker-board.de/'
              ]  # urls from which the spider will start crawling
rules = (
    Rule(LinkExtractor(allow=(r'forum/\w*',))),
    Rule(LinkExtractor(allow=(r'threads/\w+',),deny=(r'threads/\w+/[\W\d]+'),process_value=urlfunc), callback='parse_thread' ),
)

Does someone have a explanation why it keeps happening?(And a solution to it)

EDIT: updated code

回答1:

If you do some manual debugging and research you will find that the values at the end of the URL are meta-characters. %0A is a line feed, %09 is a horizontal tab: http://www.w3schools.com/tags/ref_urlencode.asp

Then if you enrich your urlfunc function with manual debug statements (and increase the log-level to INFO to see the results better) then you will see that the URLs do not end with these characters as a string just are converted when calling it as a website.

def urlfunc(value):
    print 'orgiginal: ', value
    value = value.replace('%0A', '').replace('%09', '')
    print 'replaced: ', value
    return value

This resulst in the following output:

orgiginal:  http://www.musiker-board.de/posts/7609325/

replaced:  http://www.musiker-board.de/posts/7609325/

orgiginal:  http://www.musiker-board.de/members/martin-hofmann.17/
replaced:  http://www.musiker-board.de/members/martin-hofmann.17/

The lines between the first result and the second one are there in the output because they have the meta-characters.

So the solution is to strip the values:

def urlfunc(value):
    return value.strip()

In this case you do not get any debug messages which tell you that the site was not found.



回答2:

This may happen if whitespace and tabs are in the html code.

You could clean the URL by using process_value of LinkExtractor and do something like:

...
Rule(LinkExtractor(allow=(r'threads/\w+',)), callback='parse_thread', process_value=clean_url)
...

def clean_url(value):
    value = value.replace(u'%0A', '')
    value = value.replace(u'%09', '')
    return value