In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls
. The problem is that not all elements from start_urls
have inner urls
(here I would like to return NaN
). Thus, how can I return the following 2 column dataframe (**)
:
visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com
So far, I tried to:
In:
# -*- coding: utf-8 -*-
class ToySpider(scrapy.Spider):
name = "toy_example"
allowed_domains = ["www.example.com"]
start_urls = ['https:example1.com',
'https:example2.com',
'https:example3.com']
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")
lis_ = []
for l in links:
item = ToyCrawlerItem()
item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
However, the above code its returning me:
Out:
extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com
I tried to manage the None
issue values with:
if l == None:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
But it is not working, any idea of how to get (**)
*
yes a dataframe, I know that I can do -o
, but I will do dataframe operations.
UPDATE
After reading @rrschmidt answer I tried to:
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
lis_ = []
for l in links:
item = ToyItem()
if len(l) == 0:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
#item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
print('\n\n\n Aqui:\n\n', item, "\n\n\n")
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.