How can I get an output in UTF-8 encoded unicode f

Bear with me. I'm writing every detail because so many parts of the toolchain do not handle Unicode gracefully and it's not clear what is failing.

PRELUDE

We first set up and use a recent Scrapy.

source ~/.scrapy_1.1.2/bin/activate

Since the terminal's default is ascii, not unicode, we set:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Also since by default Python uses ascii, we modify the encoding:

export PYTHONIOENCODING="utf_8"

Now we're ready to start a Scrapy project.

scrapy startproject myproject
cd myproject
scrapy genspider dorf PLACEHOLDER

We're told we now have a spider.

Created spider 'dorf' using template 'basic' in module:
  myproject.spiders.dorf

We modify myproject/items.py to be:

# -*- coding: utf-8 -*-
import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()

ATTEMPT 1

Now we write the spider, relying on urllib.unquote

# -*- coding: utf-8 -*-
import scrapy
import urllib
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = urllib.unquote(
            response.xpath('//title').extract_first().encode('ascii')
        ).decode('utf8')
        return item

And finally we use a custom item exporter (from all the way back in Oct 2011)

# -*- coding: utf-8 -*-
import json
from scrapy.exporters import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

and add

FEED_EXPORTERS = {
    'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',
}

to myproject/settings.py.

Now we run

~/myproject> scrapy crawl dorf -o dorf.json -t json

we get

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)

ATTEMPT 2

Another solution (the candidate solution for Scrapy 1.2?) is to use the spider

# -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = response.xpath('//title')[0].extract()
        return item

and the custom item exporter

# -*- coding: utf-8 -*-
from scrapy.exporters import JsonItemExporter

class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)

with

FEED_EXPORTERS = {
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}

in myproject/settings.py.

We get the following JSON file.

[
{"title": "<title>Sister cities of D\u00fcsseldorf \u2014 sistercity.info</title>"}
]

The Unicode is not UTF-8 encoded. Although this is a trivial problem for a couple of characters, it becomes a serious issue if the entire output is in a foreign language.

How can I get an output in UTF-8 unicode?

标签： scrapy

2条回答

成全新的幸福

2楼-- · 2020-04-08 14:15

please try this on your Attempt 1 and let me know if it works (I've test it without setting all those env. variables)

def to_write(uni_str):
    return urllib.unquote(uni_str.encode('utf8')).decode('utf8')


class CitiesSpider(scrapy.Spider):
    name = "cities"
    allowed_domains = ["sitercity.info"]
    start_urls = (
        'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        for i in range(2):
            item = SimpleItem()
            item['title'] = to_write(response.xpath('//title').extract_first())
            item['url'] = to_write(response.url)
            yield item

the range(2) is for testing the json exporter, to get a list of dicts you can do this instead:

# -*- coding: utf-8 -*-
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder

class UnicodeJsonLinesItemExporter(JsonItemExporter):
    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(ensure_ascii=False, **kwargs)
        self.first_item = True

0人赞添加讨论(0) 举报

戒情不戒烟

3楼-- · 2020-04-08 14:26

In Scrapy 1.2+ there is a FEED_EXPORT_ENCODING option. When FEED_EXPORT_ENCODING = "utf-8" escaping of non-ascii symbols in JSON output is turned off.

0人赞添加讨论(0) 举报

How can I get an output in UTF-8 encoded unicode f

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间