How to tidy up csv output from scrapy when using f

2019-06-14 13:12发布

After alot of help from the SO community I have a scrapy crawler which saves the webpage of the site it crawls but I'd like to clean up the csv file that gets created --output

A sample row currently looks like

"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page

How do I get the csv file to contain details on 1 file per line (no extra url:)and the path value includes an extension like .html or .txt ?

my items.py is as follows

class MycrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    crawldate = scrapy.Field()
    pageurl = scrapy.Field()
    files = scrapy.Field()
    file_urls = scrapy.Field()

My rules callback is

def scrape_page(self,response):
    page_soup = BeautifulSoup(response.body,"html.parser")
    ScrapedPageTitle = page_soup.title.get_text()
    item = MycrawlerItem()
    item['title'] =ScrapedPageTitle
    item['crawldate'] = datetime.datetime.now()
    item['pageurl'] = response.url
    item['file_urls'] = [response.url]
    yield item

In the crawler log it shows

2016-06-20 16:10:26 [scrapy] DEBUG: Scraped from <200 http://example.com/page>
{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),
 'file_urls': ['http://example.com/page'],
 'files': [{'checksum': 'checksumvalue',
            'path': 'full/hashedfile',
            'url': 'http://example.com/page'}],
 'pageurl': 'http://example.com/page',
 'title': u'My Example Page'}

The ideal structure for each csv line would be

crawldate,file_url,file_path,title

标签: scrapy
2条回答
在下西门庆
2楼-- · 2019-06-14 13:53

I was able to avoid the need to clean the csv data by specifying the xml output option instead of csv

Outputting as .xml then importing into Excel gave me a cleaner dataset of 1 row per page and without extra punctuation characters to preprocess

查看更多
我命由我不由天
3楼-- · 2019-06-14 13:55

If you want custom formats and such you probably want to just use good ol' scrapy item pipelines.

in pipelines methods process_item or close_spider you can write your item to file. Like:

def process_item(self, item, spider):
    if not getattr(spider, 'csv', False):
        return item
    with open('{}.csv'.format(spider.name), 'a') as f:
        writer = csv.writer(f)
        writer.writerow([item['crawldate'],item['title']])
    return item

This will write out <spider_name>.csv file every time you run the spider with csv flag, i.e. scrapy crawl twitter -a csv=True

You can make this more efficient if you open a file in spider_open method and close it in spider_close, but it's the same thing otherwise.

查看更多
登录 后发表回答