In a text file, there is a string "I don't like this".
However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use
f1 = open (file1, "r")
text = f1.read()
command to do the reading.
Now, is it possible to read the string in such a way that when it is read into the string, it is "I don't like this", instead of "I don\xe2\x80\x98t like this like this"?
Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?
Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:
In addition, what are you using to write the file?
f1.read()
should return a string that looks like this:If it's returning this string, the file is being written incorrectly:
But it really is "I don\u2018t like this" and not "I don't like this". The character u'\u2018' is a completely different character than "'" (and, visually, should correspond more to '`').
If you're trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.
There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you're reading.
There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:
This actually happened to me once before. You can use a
unicode_escape
codec to decode the string to unicode and then encode it to any format you want:Ref: http://docs.python.org/howto/unicode
Reading Unicode from a file is therefore simple:
It's also possible to open files in update mode, allowing both reading and writing:
EDIT: I'm assuming that your intended goal is just to be able to read the file properly into a string in Python. If you're trying to convert to an ASCII string from Unicode, then there's really no direct way to do so, since the Unicode characters won't necessarily exist in ASCII.
If you're trying to convert to an ASCII string, try one of the following:
Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example
Use the
unicodedata
module'snormalize()
and thestring.encode()
method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):It is also possible to read an encoded text file using the python 3 read method:
With this variation, there is no need to import any additional libraries
There are a few points to consider.
A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:
Now if you simply want to print the unicode string prettily, just use unicode's
encode
method:To make sure that every line from any file would be read as unicode, you'd better use the
codecs.open
function instead of justopen
, which allows you to specify file's encoding: