I have a .csv file with rows with multiple columns lengths.
import pandas as pd
df = pd.read_csv(infile, header=None)
returns the
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8
error. I know I can use the
names=my_cols
option in the read_csv call, but surely there has to be something more 'pythonic' than that?? Also, this is not a duplicate question, since
error_bad_lines=False
causes lines to be skipped (which is not desired). The .csv looks like::
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George
OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix
So read in the csv but override the separator to a tab so it doesn't try to split the names:
We can now use
str.split
withexpand=True
to expand the names into their own columns:So just to be clear modify your
read_csv
line to this:and then do the
str.split
as aboveOne can do some manipulation with the csv before using pandas.
This is some rough python, but should work. I'll clean this up when I have time.
Or use the other answer, it's neat as it is.