In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls
. The problem is that not all elements from start_urls
have inner urls
(here I would like to return NaN
). Thus, how can I return the following 2 column dataframe (**)
:
visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com
So far, I tried to:
In:
# -*- coding: utf-8 -*-
class ToySpider(scrapy.Spider):
name = "toy_example"
allowed_domains = ["www.example.com"]
start_urls = ['https:example1.com',
'https:example2.com',
'https:example3.com']
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")
lis_ = []
for l in links:
item = ToyCrawlerItem()
item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
However, the above code its returning me:
Out:
extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com
I tried to manage the None
issue values with:
if l == None:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
But it is not working, any idea of how to get (**)
*
yes a dataframe, I know that I can do -o
, but I will do dataframe operations.
UPDATE
After reading @rrschmidt answer I tried to:
def parse(self, response):
links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
lis_ = []
for l in links:
item = ToyItem()
if len(l) == 0:
item['visited_link'] = 'NaN'
else:
item['visited_link'] = response.url
#item['visited_link'] = response.url
item['extracted_link'] = l.xpath('@href').extract_first()
yield item
print('\n\n\n Aqui:\n\n', item, "\n\n\n")
lis_.append(item)
df = pd.DataFrame(lis_)
print('\n\n\n\n\n', df, '\n\n\n\n\n')
df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)
Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.
As far as I can see there are two problems with your scraper:
parse
is called for every element instart_urls
and you are creating and saving a new dataframe for each link, the dataframes you are generating are overwriting each other.That's why you will always have only one result in your
crawled_table.csv
Solution for this: create the dataframe only one time and push all items into the same dataframe object.
Then save the dataframe in each
parse
call, just in case the scraper has to stop before finishing.if l == None:
won't work asresponse.xpath
returns an empty list if no matches were found. So doingif len(l) == 0:
should doIn a gist here's how I would structure the scraper (code not tested!)