Problems while trying to crawl links inside visted

In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls. The problem is that not all elements from start_urls have inner urls (here I would like to return NaN). Thus, how can I return the following 2 column dataframe (**):

visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com

So far, I tried to:

In:

# -*- coding: utf-8 -*-


class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']


    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")

        lis_ = []

        for l in links:
            item = ToyCrawlerItem()
            item['visited_link'] = response.url
            item['extracted_link'] = l.xpath('@href').extract_first()
            yield item

        lis_.append(item)
        df = pd.DataFrame(lis_)

        print('\n\n\n\n\n', df, '\n\n\n\n\n')

        df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

However, the above code its returning me:

Out:

extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com

I tried to manage the None issue values with:

   if l == None:
                item['visited_link'] = 'NaN'
            else:
                item['visited_link'] = response.url

But it is not working, any idea of how to get (**)

* yes a dataframe, I know that I can do -o, but I will do dataframe operations.

UPDATE

After reading @rrschmidt answer I tried to:

def parse(self, response):
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")

    lis_ = []

    for l in links:

        item = ToyItem()

        if len(l) == 0:
            item['visited_link'] = 'NaN'
        else:
            item['visited_link'] = response.url

        #item['visited_link'] = response.url

        item['extracted_link'] = l.xpath('@href').extract_first()

        yield item

        print('\n\n\n Aqui:\n\n', item, "\n\n\n")

   lis_.append(item)
   df = pd.DataFrame(lis_)

   print('\n\n\n\n\n', df, '\n\n\n\n\n')

   df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.

标签： python pandas beautifulsoup scrapy web-crawler

1条回答

我欲成王，谁敢阻挡

2楼-- · 2019-06-09 14:09

As far as I can see there are two problems with your scraper:

as parse is called for every element in start_urls and you are creating and saving a new dataframe for each link, the dataframes you are generating are overwriting each other.

That's why you will always have only one result in your crawled_table.csv

Solution for this: create the dataframe only one time and push all items into the same dataframe object.

Then save the dataframe in each parse call, just in case the scraper has to stop before finishing.

if l == None: won't work as response.xpath returns an empty list if no matches were found. So doing if len(l) == 0: should do

In a gist here's how I would structure the scraper (code not tested!)

# -*- coding: utf-8 -*-

class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']

    df = pd.DataFrame()

    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
        items = []

        if len(links) == 0:
            item = ToyItem()
            # build item with visited_link = NaN here
            item['visited_link'] = response.url
            item['extracted_link'] = 'NaN'
            items.append(item)
        else:
            for l in links:
                item = ToyItem()
                # build the item as you previously did here
                item['visited_link'] = response.url
                item['extracted_link'] = l.xpath('@href').extract_first()
                items.append(item)

        items_df = pd.DataFrame(items)
        self.df = self.df.append(items_df, ignore_index=True)

        print('\n\n\n\n\n', self.df, '\n\n\n\n\n')
        self.df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

        return items

0人赞添加讨论(0) 举报

Problems while trying to crawl links inside visted

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间