Intention / Wanted result:
To scrape the link titles (i.e. the text of the links with each item) from a Czech website:
https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha
And print out the result in a CSV file. Preferably in a list so that I can later manipulate the data in another Python Data analytics model.
Result / Problem:
I am getting an UnicodeEncodeError and a TypeError. I suspect this has to do with the non-normal characters that exist in the Czech Language. Please see below for traceback.
Traceback:
TypeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 193, in export_item
self._write_headers_and_set_fields_to_export(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 217, in _write_headers_and_set_fields_to_export
self.csv_writer.writerow(row)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 718, in write
return self.writer.write(data)
File "C:\Users\phili\Anaconda3\envs\py35\lib\codecs.py", line 376, in write
data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly
UnicodeEncodeError Traceback:
2017-01-19 08:00:18 [scrapy] ERROR: Error processing {'title': b'\n Ob\xc4\x9bt\xc3\xad 6. kv\xc4\x9b'
b'tna, Praha - Kr\xc4\x8d '}
Traceback (most recent call last):
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\twisted\internet\defer.py", line 651, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\phili\Documents\Python Scripts\Scrapy Spiders\bezrealitky\bezrealitky\pipelines.py", line 24, in process_item
self.exporter.export_item(item)
File "C:\Users\phili\Anaconda3\envs\py35\lib\site-packages\scrapy\exporters.py", line 198, in export_item
self.csv_writer.writerow(values)
File "C:\Users\phili\Anaconda3\envs\py35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 37: character maps to <undefined>
Situation / Process:
I am running the scrapy crawl bezrealitky (i.e. name of spider). I have configured the pipeline with a CSVItemExporter I found on the internet, and tried to adapt it to UTF-8 encode when opening the file (I also tried in the beginning without adding UTF-8, but same error).
My pipeline code:
from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter
import codecs
class CsvPipeline(object):
def __init__(self):
self.file = codecs.open("booksdata.csv", 'wb', encoding='UTF-8')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
My settings file:
BOT_NAME = 'bezrealitky'
SPIDER_MODULES = ['bezrealitky.spiders']
NEWSPIDER_MODULE = 'bezrealitky.spiders'
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'bezrealitky.pipelines.CsvPipeline': 300,
My spider code:
class BezrealitkySpider(scrapy.Spider):
name = 'bezrealitky'
start_urls = [
'https://www.bezrealitky.cz/vypis/nabidka-prodej/byt/praha'
]
def parse(self, response):
item = BezrealitkyItem()
items = []
for records in response.xpath('//*[starts-with(@class,"record")]'):
item['title'] = response.xpath('.//div[@class="details"]/h2/a[@href]/text()').extract()[1].encode('utf-8')
items.append(item)
return(items)
Solutions tried so far:
- To add and remove .encode('utf-8) to the extract() command, and also in the pipeline.py but it didn't work.
- Also tried adding # -- coding: utf-8 -- to the beginning, didn't work either
I tried to change the python code to utf-8 in the console with this:
chcp 65001
set PYTHONIOENCODING=utf-8
Conclusion:
I am cannot get the scraped data to write to the CSV file, the CSV is created but there is nothing in it. Even though in the shell I can see that data is scraped but it isn't decoded / encoded properly and throws an error before it is writte to file.
I am complete beginner with this, just trying to pick up Scrapy. Would really appreciate any help I can get!