Problems while trying to crawl links inside visted

2019-06-09 14:08发布

问题:

In order to learn scrapy, I am trying to crawl some inner urls from a list of start_urls. The problem is that not all elements from start_urls have inner urls (here I would like to return NaN). Thus, how can I return the following 2 column dataframe (**):

visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://www.extracted-link3.com

So far, I tried to:

In:

# -*- coding: utf-8 -*-


class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']


    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a")

        lis_ = []

        for l in links:
            item = ToyCrawlerItem()
            item['visited_link'] = response.url
            item['extracted_link'] = l.xpath('@href').extract_first()
            yield item

        lis_.append(item)
        df = pd.DataFrame(lis_)

        print('\n\n\n\n\n', df, '\n\n\n\n\n')

        df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

However, the above code its returning me:

Out:

extracted_link,visited_link
https://www.extracted-link.com,https://www.example1.com

I tried to manage the None issue values with:

   if l == None:
                item['visited_link'] = 'NaN'
            else:
                item['visited_link'] = response.url

But it is not working, any idea of how to get (**)

* yes a dataframe, I know that I can do -o, but I will do dataframe operations.

UPDATE

After reading @rrschmidt answer I tried to:

def parse(self, response):
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")

    lis_ = []

    for l in links:

        item = ToyItem()

        if len(l) == 0:
            item['visited_link'] = 'NaN'
        else:
            item['visited_link'] = response.url

        #item['visited_link'] = response.url

        item['extracted_link'] = l.xpath('@href').extract_first()

        yield item

        print('\n\n\n Aqui:\n\n', item, "\n\n\n")

   lis_.append(item)
   df = pd.DataFrame(lis_)

   print('\n\n\n\n\n', df, '\n\n\n\n\n')

   df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

Nevertheless, it still returned me the same wrong output. Could anybody help me to clarify this issue?.

回答1:

As far as I can see there are two problems with your scraper:

  1. as parse is called for every element in start_urls and you are creating and saving a new dataframe for each link, the dataframes you are generating are overwriting each other.

That's why you will always have only one result in your crawled_table.csv

Solution for this: create the dataframe only one time and push all items into the same dataframe object.

Then save the dataframe in each parse call, just in case the scraper has to stop before finishing.

  1. if l == None: won't work as response.xpath returns an empty list if no matches were found. So doing if len(l) == 0: should do

In a gist here's how I would structure the scraper (code not tested!)

# -*- coding: utf-8 -*-

class ToySpider(scrapy.Spider):
    name = "toy_example"

    allowed_domains = ["www.example.com"]

    start_urls = ['https:example1.com',
                  'https:example2.com',
                  'https:example3.com']

    df = pd.DataFrame()

    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
        items = []

        if len(links) == 0:
            item = ToyItem()
            # build item with visited_link = NaN here
            item['visited_link'] = response.url
            item['extracted_link'] = 'NaN'
            items.append(item)
        else:
            for l in links:
                item = ToyItem()
                # build the item as you previously did here
                item['visited_link'] = response.url
                item['extracted_link'] = l.xpath('@href').extract_first()
                items.append(item)

        items_df = pd.DataFrame(items)
        self.df = self.df.append(items_df, ignore_index=True)

        print('\n\n\n\n\n', self.df, '\n\n\n\n\n')
        self.df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

        return items