I have a csv file which isn't coming in correctly with pandas.read_csv
when I filter the columns with usecols
and use multiple indexes.
import pandas as pd
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""
f = open('foo.csv', 'w')
f.write(csv)
f.close()
df1 = pd.read_csv('foo.csv',
header=0,
names=["dummy", "date", "loc", "x"],
index_col=["date", "loc"],
usecols=["dummy", "date", "loc", "x"],
parse_dates=["date"])
print df1
# Ignore the dummy columns
df2 = pd.read_csv('foo.csv',
index_col=["date", "loc"],
usecols=["date", "loc", "x"], # <----------- Changed
parse_dates=["date"],
header=0,
names=["dummy", "date", "loc", "x"])
print df2
I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.
In [118]: %run test.py
dummy x
date loc
2009-01-01 a bar 1
2009-01-02 a bar 3
2009-01-03 a bar 5
2009-01-01 b bar 1
2009-01-02 b bar 3
2009-01-03 b bar 5
date
date loc
a 1 20090101
3 20090102
5 20090103
b 1 20090101
3 20090102
5 20090103
Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.
edit: fixed bad header usage.
import csv first and use csv.DictReader its easy to process...
If your csv file contains extra data, columns can be deleted from the DataFrame after import.
Which gives us:
This code achieves what you want --- also its weird and certainly buggy:
I observed that it works when:
a) you specify the
index_col
rel. to the number of columns you really use -- so its three columns in this example, not four (you dropdummy
and start counting from then onwards)b) same for
parse_dates
c) not so for
usecols
;) for obvious reasonsd) here I adapted the
names
to mirror this behaviourwhich prints
The answer by @chip completely misses the point of two keyword arguments.
This solution corrects those oddities:
Which gives us: