Scrapy : restrict_css with bad formatted HTML

2019-05-27 05:01发布

The HTML code I am trying to crawl is bad formatted :

<html>
<head>...</head>
<body>
    My items here...
    My items here...
    My items here...

    Pagination here...
</body>
</head>
</html>

The problem is the second </head>. I must replace the HTML in my spider to use the xpath expressions :

class FooSpider(CrawlSpider):
    name = 'foo'
    allowed_domains = ['foo.bar']
    start_urls = ['http://foo.bar/index.php?page=1']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
              callback="parse_start_url",
              follow=True),)

def parse_start_url(self, response):
    # Remove the second </head> here
    # Perform my item

Now I want to use the restrict_xpath argument in my rule, but I can't because the HTML is bad formatted : replacement has not been performed at this time.

Do you have an idea please ?

1条回答
Animai°情兽
2楼-- · 2019-05-27 05:51

What I would do is write a Downloader middleware and use, for instance, BeautifulSoup package to fix and prettify the HTML contained inside response.body - response.replace() might be handy in this case.

Note that, if you would go with BeautifulSoup, choose a parser carefully - each parser has it's own way into the broken HTML - some are less or more lenient. lxml.html would be the best in terms of speed though.

Example:

from bs4 import BeautifulSoup

class MyMiddleware(object):
    def process_response(self, request, response, spider):
        soup = BeautifulSoup(response.body, "lxml")
        response = response.replace(body=soup.prettify())

        return response

As an example, of a custom middleware that modifies the downloaded HTML, see scrapy-splash middleware.

查看更多
登录 后发表回答