I am currently writing a script to pull information off my site which contains Japanese characters. So far I have my script pulling out the data off the site.
It has return as a string:
"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
Using an online hex to text tool, I am giving:
年に一度の晴れ姿
I know this phrase is correct, but my question is how do I convert it in python? When I run something like:
name = "\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
print(name)
I am giving this:
å¹´ã«ä¸åº¦ã®æ´ã姿
I've tried to
name.decode("hex")
But it seems like Python 3.4 doesn't have str.decode()
, so I tried to convert it to a bytes object and decode it that way, which still failed.
Edit 1:
Follow up question if you don't mind: Like the solution, Martijn Pieters gave this works:
name = "\xe2\x80\x9c\xe5\xa4\x8f\xe7\xa5\xad\xe3\x82\x8a\xe3\x83\x87\xe3\x83\xbc\xe3\x83\x88\xe2\x80\x9d\xe7\xb5\xa2\xe7\x80\xac \xe7\xb5\xb5\xe9\x87\x8c"
name = name.encode('latin1')
print(name.decode('Utf-8'))
However if I have what's in the quotes for name in a file and I do this:
with open('0N.txt',mode='r',encoding='utf-8') as f:
name = f.read()
name = name.encode('latin1')
print(name.decode('Utf-8'))
It doesn't work...any ideas?
You are confusing the Python representation with the contents. You are shown \xhh
hex escapes used in Python string literals to keep the displayed value ASCII safe and reproducable.
You have UTF-8 data here:
>>> name = b"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
>>> name.decode('utf8')
'\u5e74\u306b\u4e00\u5ea6\u306e\u6674\u308c\u59ff'
>>> print(name.decode('utf8'))
年に一度の晴れ姿
Note that I used a bytes()
string literal there, using b'...'
. If your data is not a bytes
object you have a Mojibake and need to encode to bytes first:
name.encode('latin1').decode('utf8')
Latin 1 maps codepoints one-on-one to bytes, so that's usually a safe bet to use in case of such data. It could be that you have a Mojibake in a different codec, it depends on how you retrieved the data.
If used open()
to read the data from a file, you either specified the wrong encoding
or relied on your platform default. use open(filename, encoding='utf8')
to remedy that.
If you used the requests
library to load this from a website, take into account that the response.text
attribute uses latin-1
as the default codec if a) the site didn't specify a codec and b) the response has a text/*
mime-type. If this is sourced from HTML, usually the codec is part of the HTML headers instead. Use a library like BeautifulSoup to handle HTML (using the response.content
raw bytes) and it'll detect such information for you.
If all else fails, the ftfy
library may still be able to fix a Mojibake; it uses specially constructed codecs to reverse common errors.