How to assign the url that's being scraped fro

I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm probably thinking about it the wrong way. What I want to do is add a column to my output CSV that lists the URL that each row's data was scraped from. In other words, I want the table to look like this:

item1    item2    item_url
a        1        http://url/a
b        2        http://url/a
c        3        http://url/b
d        4        http://url/b

I'm using psycopg2 to get a bunch of urls stored in database that I then scrape from. The code looks like this:

class MySpider(CrawlSpider):
    name = "spider"

    # querying the database here...

    #getting the urls from the database and assigning them to the rows list
    rows = cur.fetchall()

    allowed_domains = ["www.domain.com"]

    start_urls = []

    for row in rows:

        #adding the urls from rows to start_urls
        start_urls.append(row)

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("a bunch of xpaths here...")
            items = []
            for site in sites:
                item = SettingsItem()
                # a bunch of items and their xpaths...
                # here is my non-working code
                item['url_item'] = row
                items.append(item)
            return items

As you can see, I wanted to make an item that just takes the url that the parse function is currently on. But when I run the spider, it gives me "exceptions.NameError: global name 'row' is not defined." I think that this is because Python doesn't recognize row as a variable within the XPathSelector function, or something like that? (Like I said, I'm new.) Anyway, I'm stuck, and any help would be much appreciated.

标签： python scrapy

1条回答

▲ chillily

2楼-- · 2019-08-19 11:26

Put the start requests generation not in class body but in start_requests():

class MySpider(CrawlSpider):

    name = "spider"
    allowed_domains = ["www.domain.com"]

    def start_requests(self):
        # querying the database here...

        #getting the urls from the database and assigning them to the rows list
        rows = cur.fetchall()

        for url, ... in rows:
            yield self.make_requests_from_url(url)


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("a bunch of xpaths here...")

        for site in sites:
            item = SettingsItem()
            # a bunch of items and their xpaths...
            # here is my non-working code
            item['url_item'] = response.url

            yield item

0人赞添加讨论(0) 举报

How to assign the url that's being scraped fro

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间