I used df.to_csv()
to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.
However when I run pd.read_csv()
on the same file, I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte
But using pd.read_csv()
with encoding="ISO-8859-1"
works.
What is the issue here and how do I resolve it so I can write and load files with consistent encoding?
Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding
parameter with pandas.to_csv
.
0x92 is ’ (looks like an apostrophe)
import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'
df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])
df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")
df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)
df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
So it looks like you have to explicitly use encoding="utf-8"
with to_csv
even though pandas docs say it is using this by default. Or use encoding="Latin-1"
with read_csv
.
Even more frustrating...
df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
I am using Windows 7, Python 3.5, pandas 0.19.2.
The original .csv
you are trying to read is encoded
in e.g. ISO-8859-1
. That's why it's a UnicodeDecodeError
- python / pandas is trying to decode
the source using utf-8
codec assuming per default the source is unicode
.
Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.
See python docs and more here. Also very good.
Please try to read the data using encoding='unicode_escape'.