Python 3.4 hex to Japanese Characters

2019-04-12 11:23发布

I am currently writing a script to pull information off my site which contains Japanese characters. So far I have my script pulling out the data off the site.

It has return as a string:

"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf" 

Using an online hex to text tool, I am giving:

年に一度の晴れ姿

I know this phrase is correct, but my question is how do I convert it in python? When I run something like:

name = "\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
print(name)

I am giving this:

å¹´ã«ä¸åº¦ã®æ´ã姿

I've tried to

name.decode("hex")

But it seems like Python 3.4 doesn't have str.decode(), so I tried to convert it to a bytes object and decode it that way, which still failed.

Edit 1:

Follow up question if you don't mind: Like the solution, Martijn Pieters gave this works:

name = "\xe2\x80\x9c\xe5\xa4\x8f\xe7\xa5\xad\xe3\x82\x8a\xe3\x83\x87\xe3\x83\xbc\xe3\x8‌​3\x88\xe2\x80\x9d\xe7\xb5\xa2\xe7\x80\xac \xe7\xb5\xb5\xe9\x87\x8c" 
name = name.encode('latin1') 
print(name.decode('Utf-8')) 

However if I have what's in the quotes for name in a file and I do this:

with open('0N.txt',mode='r',encoding='utf-8') as f: 
    name = f.read() 
name = name.encode('latin1') 
print(name.decode('Utf-8')) 

It doesn't work...any ideas?

1条回答
爷、活的狠高调
2楼-- · 2019-04-12 12:08

You are confusing the Python representation with the contents. You are shown \xhh hex escapes used in Python string literals to keep the displayed value ASCII safe and reproducable.

You have UTF-8 data here:

>>> name = b"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
>>> name.decode('utf8')
'\u5e74\u306b\u4e00\u5ea6\u306e\u6674\u308c\u59ff'
>>> print(name.decode('utf8'))
年に一度の晴れ姿

Note that I used a bytes() string literal there, using b'...'. If your data is not a bytes object you have a Mojibake and need to encode to bytes first:

name.encode('latin1').decode('utf8')

Latin 1 maps codepoints one-on-one to bytes, so that's usually a safe bet to use in case of such data. It could be that you have a Mojibake in a different codec, it depends on how you retrieved the data.

If used open() to read the data from a file, you either specified the wrong encoding or relied on your platform default. use open(filename, encoding='utf8') to remedy that.

If you used the requests library to load this from a website, take into account that the response.text attribute uses latin-1 as the default codec if a) the site didn't specify a codec and b) the response has a text/* mime-type. If this is sourced from HTML, usually the codec is part of the HTML headers instead. Use a library like BeautifulSoup to handle HTML (using the response.content raw bytes) and it'll detect such information for you.

If all else fails, the ftfy library may still be able to fix a Mojibake; it uses specially constructed codecs to reverse common errors.

查看更多
登录 后发表回答