Handling Variable Number of Columns with Pandas -

I have a data set that looks like this (at most 5 columns - but can be less)

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

If I try

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

标签： python pandas

3条回答

一纸荒年 Trace。

2楼-- · 2020-01-23 04:23

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2020-01-23 04:25

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2020-01-23 04:32

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

0人赞添加讨论(0) 举报

Handling Variable Number of Columns with Pandas -

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间