Using the NumPy loadtxt
and savetxt
functions fails whenever non-ASCII characters are involved. These function are primarily ment for numeric data, but alphanumeric headers/footers are also supported.
Both loadtxt
and savetxt
seem to be applying the latin-1 encoding, which I find very orthogonal to the rest of Python 3, which is thoroughly unicode-aware and always seem to be using utf-8 as the default encoding.
Given that NumPy hasn't moved to utf-8 as the default encoding, can I at least change the encoding away from latin-1, either via some implemented function/attribute or a known hack, either just for loadtxt
/savetxt
or for NumPy in its entirety?
That this is not possible with Python 2 is forgivable, but it really should not be a problem when using Python 3. I've found the problem using any combination of Python 3.x and the last many versions of NumPy.
Example code
Consider the file data.txt
with the content
# This is π
3.14159265359
Trying to load this with
import numpy as np
pi = np.loadtxt('data.txt')
print(pi)
fails with a UnicodeEncodeError
exception, stating that the latin-1 codec can't encode the character '\u03c0
' (the π
character).
This is frustrating because π
is only present in a comment/header line, so there is no reason for loadtxt
to even attempt to encode this character.
I can successfully read in the file by explicitly skipping the first row, using pi = np.loadtxt('data.txt', skiprows=1)
, but it is inconvenient to have to know the exact number of header lines.
The same exception is thrown if I try to write a unicode character using savetxt
:
np.savetxt('data.txt', [3.14159265359], header='# This is π')
To accomplish this task successfully, I first have to write the header by some other means, and then save the data to a file object opened with the 'a+b'
mode, e.g.
with open('data.txt', 'w') as f:
f.write('# This is π\n')
with open('data.txt', 'a+b') as f:
np.savetxt(f, [3.14159265359])
which needless to say is both ugly and inconvenient.
Solution
I settled on the solution by hpaulj, which I thought would be nice to spell out fully. Near the top of my program I now do
import numpy as np
asbytes = lambda s: s if isinstance(s, bytes) else str(s).encode('utf-8')
asstr = lambda s: s.decode('utf-8') if isinstance(s, bytes) else str(s)
np.compat.py3k.asbytes = asbytes
np.compat.py3k.asstr = asstr
np.compat.py3k.asunicode = asstr
np.lib.npyio.asbytes = asbytes
np.lib.npyio.asstr = asstr
np.lib.npyio.asunicode = asstr
after which np.loadtxt
and np.savetxt
handles Unicode correctly.
Note that for newer versions of NumPy (I can confirm 1.14.3, but properly somewhat older versions as well) this trick is not needed, as it seems that Unicode is now handled properly by default.