The HTML code I am trying to crawl is bad formatted :
<html>
<head>...</head>
<body>
My items here...
My items here...
My items here...
Pagination here...
</body>
</head>
</html>
The problem is the second </head>
. I must replace the HTML in my spider to use the xpath expressions :
class FooSpider(CrawlSpider):
name = 'foo'
allowed_domains = ['foo.bar']
start_urls = ['http://foo.bar/index.php?page=1']
rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
callback="parse_start_url",
follow=True),)
def parse_start_url(self, response):
# Remove the second </head> here
# Perform my item
Now I want to use the restrict_xpath
argument in my rule, but I can't because the HTML is bad formatted : replacement has not been performed at this time.
Do you have an idea please ?
What I would do is write a Downloader middleware and use, for instance,
BeautifulSoup
package to fix and prettify the HTML contained insideresponse.body
-response.replace()
might be handy in this case.Note that, if you would go with
BeautifulSoup
, choose a parser carefully - each parser has it's own way into the broken HTML - some are less or more lenient.lxml.html
would be the best in terms of speed though.Example:
As an example, of a custom middleware that modifies the downloaded HTML, see
scrapy-splash
middleware.