Python, scrapy : bad utf8 characters writed in fil

2019-08-15 03:31发布

问题:

I want to scrap a webpage with charset iso-8859-1 with Scrapy, in python 2.7. The text i'm interesting in on the webpage is : tempête

Scrapy returns response as an UTF8 unicode with characters correctly encoded :

>>> response
u'temp\xc3\xaate'

Now, I want to write the word tempête in a file, so I'm doing the following :

>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var

When I open the file, the resulting text is tempête. It seems that python does not detect proper encoding and can't read the two bytes encoded char and thinks it's two one-coded char.

How can I handle this simple use case ?

回答1:

In your example, response is a (decoded) Unicode string with \xc3\xa inside, then something is wrong at scrapy encoding detection level.

\xc3\xa is character ê encoded as UTF-8, so you should only see those character for (encoded) non-Unicode/str strings (in Python 2 that is)

Python 2.7 shell session:

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempête
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>> 

Something is wrong with Scrapy interpreting page as iso-8859-1 encoded.

You can force the encoding by re-building a response from response.body:

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>> 

Build a new reponse

newresponse = response.replace(encoding='utf-8')

and work with newresponse instead



回答2:

You need to encode your response as iso-8859-1 first and then decode (convert) it to utf-8 before writing to a file opened as utf-8

response = u'temp\xc3\xaate'
r1 = response.encode('iso-8859-1')
r2 = r1.decode('utf-8')

Interesting read: http://farmdev.com/talks/unicode/