Encoding error when reading csv file containing pa

2019-07-09 11:05发布

I used df.to_csv() to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.

However when I run pd.read_csv() on the same file, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works.

What is the issue here and how do I resolve it so I can write and load files with consistent encoding?

3条回答
疯言疯语
2楼-- · 2019-07-09 11:20

The original .csv you are trying to read is encoded in e.g. ISO-8859-1. That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode.

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.

See python docs and more here. Also very good.

查看更多
一纸荒年 Trace。
3楼-- · 2019-07-09 11:39

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv.

0x92 is ’ (looks like an apostrophe)

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)
df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. Or use encoding="Latin-1" with read_csv.

Even more frustrating...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2.

查看更多
Melony?
4楼-- · 2019-07-09 11:44

Please try to read the data using encoding='unicode_escape'.

查看更多
登录 后发表回答