After alot of help from the SO community I have a scrapy crawler which saves the webpage of the site it crawls but I'd like to clean up the csv file that gets created --output
A sample row currently looks like
"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page
How do I get the csv file to contain details on 1 file per line (no extra url:)and the path value includes an extension like .html or .txt ?
my items.py is as follows
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
crawldate = scrapy.Field()
pageurl = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
My rules callback is
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = MycrawlerItem()
item['title'] =ScrapedPageTitle
item['crawldate'] = datetime.datetime.now()
item['pageurl'] = response.url
item['file_urls'] = [response.url]
yield item
In the crawler log it shows
2016-06-20 16:10:26 [scrapy] DEBUG: Scraped from <200 http://example.com/page>
{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),
'file_urls': ['http://example.com/page'],
'files': [{'checksum': 'checksumvalue',
'path': 'full/hashedfile',
'url': 'http://example.com/page'}],
'pageurl': 'http://example.com/page',
'title': u'My Example Page'}
The ideal structure for each csv line would be
crawldate,file_url,file_path,title
I was able to avoid the need to clean the csv data by specifying the xml output option instead of csv
Outputting as .xml then importing into Excel gave me a cleaner dataset of 1 row per page and without extra punctuation characters to preprocess
If you want custom formats and such you probably want to just use good ol' scrapy item pipelines.
in pipelines methods
process_item
orclose_spider
you can write your item to file. Like:This will write out
<spider_name>.csv
file every time you run the spider withcsv
flag, i.e.scrapy crawl twitter -a csv=True
You can make this more efficient if you open a file in
spider_open
method and close it inspider_close
, but it's the same thing otherwise.