Python unicode behaviour in Google App Engine

2019-04-17 13:30发布

I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.

Then I decode it, using .decode('cp1251') and parse with lxml.

My code works totally fine on my local machine:

import re
import leaf #simple wrapper for lxml
weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251'))))
table=document.get('table')
trs=table('tr') #leaf syntax
for tr in trs:
    tds=tr.xpath('td')
    for td in tds:
        if td.colspan=='3':
            curweek=re.findall('\w+(?=\-)', td.text)[0]               
            curday=weekdaysD[td.text.split(u',')[0]]

but when I deploy it to gae, I get:

curday=weekdaysD[td.text.split(u',')[0]]
KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'

How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.

UPD: also, if I add to my script on GAE:

print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0]) 

It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?

2条回答
虎瘦雄心在
2楼-- · 2019-04-17 14:01

Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.

查看更多
Bombasti
3楼-- · 2019-04-17 14:19

That string you got in the error message has unicode for its type but the contents is actually the bytes that would be the UTF-8 encoding of вторник. It would be helpful if you showed us the code that does the urlfetch call, since there is nothing wrong with the code you are showing.

查看更多
登录 后发表回答