I used df.to_csv()
to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.
However when I run pd.read_csv()
on the same file, I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte
But using pd.read_csv()
with encoding="ISO-8859-1"
works.
What is the issue here and how do I resolve it so I can write and load files with consistent encoding?
The original
.csv
you are trying to read isencoded
in e.g.ISO-8859-1
. That's why it's aUnicodeDecodeError
- python / pandas is trying todecode
the source usingutf-8
codec assuming per default the source isunicode
.Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.
See python docs and more here. Also very good.
Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the
encoding
parameter withpandas.to_csv
.0x92 is ’ (looks like an apostrophe)
So it looks like you have to explicitly use
encoding="utf-8"
withto_csv
even though pandas docs say it is using this by default. Or useencoding="Latin-1"
withread_csv
.Even more frustrating...
I am using Windows 7, Python 3.5, pandas 0.19.2.
Please try to read the data using encoding='unicode_escape'.