Converting a latin string to unicode in python

I am working o scrapy, I scraped some sites and stored the items from the scraped page in to json files, but some of them are containing the following format.

l = ["Holding it Together",
     "Fowler RV Trip",
     "S\u00e9n\u00e9gal - Mali - Niger","H\u00eatres et \u00e9tang",
     "Coll\u00e8ge marsan","N\u00b0one",
     "Lines through the days 1 (Arabic) \u0633\u0637\u0648\u0631 \u0639\u0628\u0631 \u0627\u0644\u0623\u064a\u0627\u0645 1",
     "\u00cdndia, Tail\u00e2ndia &amp; Cingapura"]

I can expect that the list consists of different format, but i want to convert that and store the strings in the list with their original names like below

l = ["Holding it Together",
     "Fowler RV Trip",
     "Lines through the days 1 (Arabic) سطور عبر الأيام 1 | شمس الدين خ | Blogs"         ,
     "Índia, Tailândia & Cingapura "]

Thanks in advance...........

标签： python unicode scrapy latin

2条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-02-11 02:01

i want to convert that and store the strings in the list with their original names like below

When you serialise to JSON, there may be a flag that allows you to turn off the escaping of non-ASCII characters to \u sequences. If you are using the standard library json module, it's ensure_ascii:

>>> print json.dumps(u'Índia')
"\u00cdndia"
>>> print json.dumps(u'Índia', ensure_ascii= False)
"Índia"

However be aware that with that safety measure taken away you now have to be able to deal with non-ASCII characters in a correct way, or you'll get a bunch of UnicodeErrors. For example if you are writing the JSON to a file you must explicitly encode the Unicode string to the charset you want (for example UTF-8).

j= json.dumps(u'Índia', ensure_ascii= False)
open('file.json', 'wb').write(j.encode('utf-8'))

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-02-11 02:03

You have byte strings containing unicode escapes. You can convert them to unicode with the unicode_escape codec:

>>> print "H\u00eatres et \u00e9tang".decode("unicode_escape")
Hêtres et étang

And you can encode it back to byte strings:

>>> s = "H\u00eatres et \u00e9tang".decode("unicode_escape")
>>> s.encode("latin1")
'H\xeatres et \xe9tang'

You can filter and decode the non-unicode strings like:

for s in l: 
    if not isinstance(s, unicode): 
        print s.decode('unicode_escape')

0人赞添加讨论(0) 举报

Converting a latin string to unicode in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间