Python unicode behaviour in Google App Engine

2019-04-17 14:08发布

问题:

I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.

Then I decode it, using .decode('cp1251') and parse with lxml.

My code works totally fine on my local machine:

import re
import leaf #simple wrapper for lxml
weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251'))))
table=document.get('table')
trs=table('tr') #leaf syntax
for tr in trs:
    tds=tr.xpath('td')
    for td in tds:
        if td.colspan=='3':
            curweek=re.findall('\w+(?=\-)', td.text)[0]               
            curday=weekdaysD[td.text.split(u',')[0]]

but when I deploy it to gae, I get:

curday=weekdaysD[td.text.split(u',')[0]]
KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'

How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.

UPD: also, if I add to my script on GAE:

print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0]) 

It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?

回答1:

That string you got in the error message has unicode for its type but the contents is actually the bytes that would be the UTF-8 encoding of вторник. It would be helpful if you showed us the code that does the urlfetch call, since there is nothing wrong with the code you are showing.



回答2:

Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.