Problems writing Scraped data to csv with Slavic c

2019-09-14 09:16发布

问题:

Intention / Wanted result:

To scrape the link titles (i.e. the text of the links with each item) from a Czech website:

https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha

And print out the result in a CSV file. Preferably in a list so that I can later manipulate the data in another Python Data analytics model.

Result / Problem:

I am getting an UnicodeEncodeError and a TypeError. I suspect this has to do with the non-normal characters that exist in the Czech Language. Please see below for traceback.

Traceback:

TypeError Traceback:

2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n                                Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
          b'tna, Praha - Kr\xc4\x8d                            '}
Traceback (most recent call last):
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
    self.exporter.export_item(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 193, in export_item
    self._write_headers_and_set_fields_to_export(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 217, in _write_headers_and_set_fields_to_export
    self.csv_writer.writerow(row)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 718, in write
    return self.writer.write(data)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 376, in write
    data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly

UnicodeEncodeError Traceback:

2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n                                Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
          b'tna, Praha - Kr\xc4\x8d                            '}
Traceback (most recent call last):
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
    self.exporter.export_item(item)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 198, in export_item
    self.csv_writer.writerow(values)
  File "C:\Users\phili\Anaconda3\envs\py35\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 37: character maps to <undefined>

Situation / Process:

I am running the scrapy crawl bezrealitky (i.e. name of spider). I have configured the pipeline with a CSVItemExporter I found on the internet, and tried to adapt it to UTF-8 encode when opening the file (I also tried in the beginning without adding UTF-8, but same error).

My pipeline code:

from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter
import codecs


class CsvPipeline(object):
    def __init__(self):
        self.file = codecs.open("booksdata.csv", 'wb', encoding='UTF-8')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

My settings file:

BOT_NAME = 'bezrealitky'

SPIDER_MODULES = ['bezrealitky.spiders']
NEWSPIDER_MODULE = 'bezrealitky.spiders'

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'bezrealitky.pipelines.CsvPipeline': 300,

My spider code:

class BezrealitkySpider(scrapy.Spider):
    name = 'bezrealitky'
    start_urls = [
        'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
    ]
    def parse(self, response):
        item = BezrealitkyItem()
        items = []
        for records in response.xpath('//*[starts-with(@class,"record")]'):
            item['title'] = response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8')
            items.append(item)
        return(items)

Solutions tried so far:

  • To add and remove .encode('utf-8) to the extract() command, and also in the pipeline.py but it didn't work.
  • Also tried adding # -- coding: utf-8 -- to the beginning, didn't work either
  • I tried to change the python code to utf-8 in the console with this:

    chcp 65001

    set PYTHONIOENCODING=utf-8

Conclusion:

I am cannot get the scraped data to write to the CSV file, the CSV is created but there is nothing in it. Even though in the shell I can see that data is scraped but it isn't decoded / encoded properly and throws an error before it is writte to file.

I am complete beginner with this, just trying to pick up Scrapy. Would really appreciate any help I can get!

回答1:

What I use in order to scrape Czech websites and avoid this errors is unidecode module. What this module does is an ASCII transliterations of Unicode text.

# -*- coding: utf-8 -*-
from unidecode import unidecode

class BezrealitkySpider(scrapy.Spider):
    name = 'bezrealitky'
    start_urls = [
        'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
    ]
    def parse(self, response):
        item = BezrealitkyItem()
        items = []
        for records in response.xpath('//*[starts-with(@class,"record")]'):
            item['title'] = unidecode(response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8'))
            items.append(item)
        return(items)

Since I use an ItemLoader my code look kind of like this:

# -*- coding: utf-8 -*-
from scrapy.loader import ItemLoader

class BaseItemLoader(ItemLoader):
    title_in = MapCompose(unidecode)