I have a .csv file with rows with multiple columns lengths.
import pandas as pd
df = pd.read_csv(infile, header=None)
returns the
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8
error. I know I can use the
names=my_cols
option in the read_csv call, but surely there has to be something more 'pythonic' than that?? Also, this is not a duplicate question, since
error_bad_lines=False
causes lines to be skipped (which is not desired). The .csv looks like::
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George
One can do some manipulation with the csv before using pandas.
This is some rough python, but should work. I'll clean this up when I have time.
Or use the other answer, it's neat as it is.
OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix
So read in the csv but override the separator to a tab so it doesn't try to split the names:
We can now use
str.split
withexpand=True
to expand the names into their own columns:So just to be clear modify your
read_csv
line to this:and then do the
str.split
as above