I have a CSV containing special characters. Some cells are arithmetic operations (like "(10/2)").
I would like to import these cells as string in numpy by using np.genfromtxt.
What I notice is that it actually import them in UTF8 (if I understood). For instance everytime I have a division symbol I get this code in the numpy array :\xc3\xb7
How could I import these arithmetic operations as readable string?
Thank you!
Looks like the file may have the 'other' divide symbol, the one we learn in grade school:
In [185]: b'\xc3\xb7'
Out[185]: b'\xc3\xb7'
In [186]: _.decode()
Out[186]: '÷'
Recent numpy version(s) handle encoding better. Earlier ones tried to work entirely in bytestring mode (for Py3) to be compatible with Py2. But now it takes an encoding
parameter.
In [68]: txt = '''(10/2), 1, 2
...: (10/2), 3,4'''
In [70]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
#!/usr/bin/python3
Out[70]:
array([(b'(10/2)', 1, 2), (b'(10/2)', 3, 4)],
dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])
In [71]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding=None
...: )
Out[71]:
array([('(10/2)', 1, 2), ('(10/2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
Admittedly this simulated load from a list of strings is not the same as loading from a file. I don't have earlier numpys installed (and not on Py2), so can't show what happened before. But my gut feeling is that "(10/2)" shouldn't have given problems before, at least not in an ASCII file. There aren't any special characters in the string.
With the other divide:
In [192]: txt = '''(10÷2), 1, 2
...: (10÷2), 3,4'''
In [194]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding='ut
...: f8')
Out[194]:
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
Same thing in a file:
In [200]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
#!/usr/bin/python3
Out[200]:
array([(b'(10\xf72)', 1, 2), (b'(10\xf72)', 3, 4)],
dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])
In [199]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',',encoding=
...: 'utf8')
Out[199]:
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
In earlier versions, encoding
could be implemented in a converter
. I've helped with that task in previous SO questions.