I'm pretty new to Python and Scrapy and this site has been an invaluable resource so far for my project, but now I'm stuck on a problem that seems like it'd be pretty simple. I'm probably thinking about it the wrong way. What I want to do is add a column to my output CSV that lists the URL that each row's data was scraped from. In other words, I want the table to look like this:
item1 item2 item_url
a 1 http://url/a
b 2 http://url/a
c 3 http://url/b
d 4 http://url/b
I'm using psycopg2 to get a bunch of urls stored in database that I then scrape from. The code looks like this:
class MySpider(CrawlSpider):
name = "spider"
# querying the database here...
#getting the urls from the database and assigning them to the rows list
rows = cur.fetchall()
allowed_domains = ["www.domain.com"]
start_urls = []
for row in rows:
#adding the urls from rows to start_urls
start_urls.append(row)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("a bunch of xpaths here...")
items = []
for site in sites:
item = SettingsItem()
# a bunch of items and their xpaths...
# here is my non-working code
item['url_item'] = row
items.append(item)
return items
As you can see, I wanted to make an item that just takes the url that the parse function is currently on. But when I run the spider, it gives me "exceptions.NameError: global name 'row' is not defined." I think that this is because Python doesn't recognize row as a variable within the XPathSelector function, or something like that? (Like I said, I'm new.) Anyway, I'm stuck, and any help would be much appreciated.
Put the start requests generation not in class body but in
start_requests()
: