Pandas ExcelFile.parse() reading file in as dict i

2019-09-16 12:31发布

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)

I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:

xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df

Results:

{'Security exposure - 21 day lag':          Percent of Total Holdings  \
 0      KMNFC vs. 3 Month LIBOR AUD   
 1                        04-OCT-16   
 2                Australian Dollar   
 3                              NaN   
 4                              NaN   
 5                              NaN   
 6                              NaN   
 7                              NaN   
 8          Long/Short Net Exposure   
 9                            Total   
 10                             NaN   
 11                            Long   
 12                             NaN   
 13                             NaN   
 14                             NaN   
 15                             NaN   
 16                             NaN   
 17                             NaN  

(This goes on for 20-30 more rows and 5-6 more columns)

I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:

enter image description here

However, I cannot use iloc:

df.iloc[:,1]

Traceback (most recent call last):

  File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
    df.iloc[:,1]

AttributeError: 'dict' object has no attribute 'iloc'

Any thoughts? What am I missing?

EDIT:

To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.

EDIT (2):

Per a suggestion, I tried 'read_excel', with the same results:

df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):

  File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
    df.loc[1]

AttributeError: 'dict' object has no attribute 'loc'


df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):

  File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
    df.loc[1]

AttributeError: 'dict' object has no attribute 'loc'

1条回答
时光不老,我们不散
2楼-- · 2019-09-16 12:47

The problem was here:

sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]

which returned a single element list. I changed it to the following:

sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]

which returns a string, and the code then performs as expected.

All thanks to ayhan for pointing this out.

查看更多
登录 后发表回答