Encoding error when reading csv file containing pa

I used df.to_csv() to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.

However when I run pd.read_csv() on the same file, I get the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte

But using pd.read_csv() with encoding="ISO-8859-1" works.

What is the issue here and how do I resolve it so I can write and load files with consistent encoding?

标签： python csv pandas encoding

3条回答

疯言疯语

2楼-- · 2019-07-09 11:20

The original .csv you are trying to read is encoded in e.g. ISO-8859-1. That's why it's a UnicodeDecodeError - python / pandas is trying to decode the source using utf-8 codec assuming per default the source is unicode.

Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.

See python docs and more here. Also very good.

0人赞添加讨论(0) 举报

一纸荒年 Trace。

3楼-- · 2019-07-09 11:39

Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding parameter with pandas.to_csv.

0x92 is ’ (looks like an apostrophe)

import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'

df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])

df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")

df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)

df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

So it looks like you have to explicitly use encoding="utf-8" with to_csv even though pandas docs say it is using this by default. Or use encoding="Latin-1" with read_csv.

Even more frustrating...

df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

I am using Windows 7, Python 3.5, pandas 0.19.2.

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-07-09 11:44

Please try to read the data using encoding='unicode_escape'.

0人赞添加讨论(0) 举报

Encoding error when reading csv file containing pa

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间