I have a similar problem to the one mentioned here but none of the suggested methods work for me.
I have a medium size utf-8
.csv file with a lot of non-ascii characters.
I am splitting the file by a particular value from one of the columns, and then I'd like to save each of the obtained dataframes as an .xlsx file with the characters preserved.
This doesn't work, as I am getting an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 7: ordinal not in range(128)
Here is what I tried:
- Using
xlsxwriter
engine explicitly. This doesn't seem to change anything. Defining a function (below) to change encoding and throw away bad characters. This also doesn't change anything.
def changeencode(data): cols = data.columns for col in cols: if data[col].dtype == 'O': data[col] = data[col].str.decode('utf-8').str.encode('ascii', 'ignore') return data
Changing by hand all the offensive chars to some others. Still no effect (the quoted error was obtained after this change).
Encoding the file as
utf-16
(which, I believe, is the correct encoding since I want to be able to manipulate the file from within the excel afterwards) doesn't help either.
I believe that the problem is in the file itself (because of 2 and 3) but I have no idea how to get around it. I'd appreciate any help. The beginning of the file is pasted below.
"Submitted","your-name","youremail","phone","miasto","cityCF","innemiasto","languagesCF","morelanguages","wiek","partnerCF","messageCF","acceptance-795","Submitted Login","Submitted From","2015-12-25 14:07:58 +00:00","Zózia kryś","test@tes.pl","4444444","Wrocław","","testujemy polskie znaki","Polski","testujemy polskie znaki","44","test","test","1","Justyna","99.111.155.132",
EDIT
Some code (one of the versions, without the splitting part):
import pandas as pd
import string
import xlsxwriter
df = pd.read_csv('path-to-file.csv')
with pd.ExcelWriter ('test.xlsx') as writer:
df.to_excel(writer, sheet_name = 'sheet1',engine='xlsxwriter')