pandas read_csv and filter columns with usecols

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.

标签： python pandas csv csv-import

4条回答

【Aperson】

2楼-- · 2019-01-12 21:38

import csv first and use csv.DictReader its easy to process...

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-01-12 21:54

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
del df['dummy']

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2019-01-12 21:58

This code achieves what you want --- also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

0人赞添加讨论(0) 举报

祖国的老花朵

5楼-- · 2019-01-12 22:00

The answer by @chip completely misses the point of two keyword arguments.

names is only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

This solution corrects those oddities:

import pandas as pd
from StringIO import StringIO

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(StringIO(csv),
        header=0,
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])

Which gives us:

                x
date       loc
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

0人赞添加讨论(0) 举报

pandas read_csv and filter columns with usecols

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间