I parse websites and it works fine but I need to add new colum with IDs to output. That column is saved in csv with urls:
https://www.ceneo.pl/48523541, 1362
https://www.ceneo.pl/46374217, 2457
Code of my spider:
import scrapy
from ceneo.items import CeneoItem
import csv
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
start_urls = []
f = open('urls.csv', 'r')
for i in f:
u = i.split(',')
start_urls.append(u[0])
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
all_prices = response.xpath('(//td[@class="cell-price"] /a/span/span/span[@class="value"]/text())[position() <= 10]').extract()
all_sellers = response.xpath('(//tr/td/div/ul/li/a[@class="js_product-offer-link"]/text())[position()<=10]').extract()
f = open('urls.csv', 'r')
id = []
for i in f:
u = i.split(',')
id.append(u[1])
x = len(all_prices)
i = 0
while (i < x):
all_sellers[i] = all_sellers[i].replace('Opinie o ', '')
i += 1
for urlid, price, seller in zip(id, all_prices, all_sellers):
yield {'urlid': urlid.strip(), 'price': price.strip(), 'seller': seller.strip()}
In the results I get wrong data because (zip funtion?) IDs are taken alternately:
urlid,price,seller
1362,109,eMAG
1457,116,electro.pl
1362,597,apollo.pl
1457,597,allegro.pl
And it should output:
urlid,price,seller
1362,109,eMAG
1362,116,electro.pl
1457,597,apollo.pl
1457,597,allegro.pl
You can get
ID
instart_requests
and assign to request usingmeta={'id': id_}
and later inparse
you can getID
usingresponse.meta['id']
.This way you will have correct
ID
inparse
.I use string
data
instead of file to create working example.BTW: there is standard function
id()
so I use variableid_
instead ofid