I have to write a script that support reading of a file which can be saved as either Unicode or Ansi (using MS's notepad).
I don't have any indication of the encoding format in the file, how can I support both encoding formats? (kind of a generic way of reading files with out knowing the format in advanced).
MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:
"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16
to decode such a file.
"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig
to decode such a file.
"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".
Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:
cp874 Thai
cp932 Japanese
cp936 Unified Chinese (P.R. China, Singapore)
cp949 Korean
cp950 Traditional Chinese (Taiwan, Hong Kong, Macao(?))
cp1250 Central and Eastern Europe
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek
cp1254 Turkish
cp1255 Hebrew
cp1256 Arabic script
cp1257 Baltic languages
cp1258 Vietnamese
cp???? languages/scripts of India
If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding()
. Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.
Be careful using codecs.open()
to read files on Windows. The docs say: """Note
Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n
and you will need/want to strip those off.
Putting it all together:
Sample text file, saved with all 4 encoding choices, looks like this in Notepad:
The quick brown fox jumped over the lazy dogs.
àáâãäå
Here is some demo code:
import locale
def guess_notepad_encoding(filepath, default_ansi_encoding=None):
with open(filepath, 'rb') as f:
data = f.read(3)
if data[:2] in ('\xff\xfe', '\xfe\xff'):
return 'utf-16'
if data == u''.encode('utf-8-sig'):
return 'utf-8-sig'
# presumably "ANSI"
return default_ansi_encoding or locale.getpreferredencoding()
if __name__ == "__main__":
import sys, glob, codecs
defenc = sys.argv[1]
for fpath in glob.glob(sys.argv[2]):
print
print (fpath, defenc)
with open(fpath, 'rb') as f:
print "raw:", repr(f.read())
enc = guess_notepad_encoding(fpath, defenc)
print "guessed encoding:", enc
with codecs.open(fpath, 'r', enc) as f:
for lino, line in enumerate(f, 1):
print lino, repr(line)
print lino, repr(line.rstrip('\r\n'))
and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt
('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'
('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'
('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'
('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'
Things to be aware of:
(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252
, it makes like latin1
(aarrgghh!!); see below
>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>
(2) chardet
is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.
Notepad saves Unicode files with a byte order mark. This means that the first bytes of the file will be:
- EF BB BF -- UTF-8
- FF FE -- "Unicode" (actually UTF-16 little-endian, looks like)
- FE FF -- "Unicode big-endian" (looks like UTF-16 big-endian)
Other text editors may or may not have the same behavior, but if you know for sure Notepad is being used, this will give you a decent heuristic for auto-selecting the encoding. All these sequences are valid in the ANSI encoding as well, however, so it is possible for this heuristic to make mistakes. It is not possible to guarantee that the correct encoding is used.