The following is only the beginning for an Coursera assignment on Data Science. I hope this is not to trivial for. But I am lost on this and could not find an answer.
I am asked to import an Excelfile into a panda dataframe and to manipulate it afterwards. The file can be found here: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
What makes it difficult for me is
a) there is an 'overhead' of 17 lines and a footer
b) the first two columns are empty
c) the index column has no header name
After hours if seraching and reading I came up with this useless line:
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[17],
skipfooter=38,
skipcolumns=2
)
This seems to produce a multindex dataframe. Though the command energy.head() returns nothing.
I have two questions:
- what did I wrong. Up to this exercise I thought I understand the dataframe. But now I am totally clueless and lost :-((
- How do I have to tackle this? What do I have to do to get this Exceldata into a datafrae with the index consisting of the countries?
Thanks.
I think you need add parameters:
index_col
for convert column to index
usecols
- parse columns by positions
- change header position to
15
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
skiprows=[17],
skipfooter=38,
header=15,
index_col=[0],
usecols=[2,3,4,5]
)
print (energy.head())
Energy Supply Energy Supply per capita \
Afghanistan 321 10
Albania 102 35
Algeria 1959 51
American Samoa ... ...
Andorra 9 121
Renewable Electricity Production
Afghanistan 78.669280
Albania 100.000000
Algeria 0.551010
American Samoa 0.641026
Andorra 88.695650
I installed xlrd
package, with pip install xlrd
and then loaded the file successfully as follows:
In [17]: df = pd.read_excel(r"http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls",
...: sheetname='Energy',
...: header=16,
...: skiprows=[17],
...: skipfooter=38,
...: skipcolumns=2)
In [18]: df.shape
Out[18]: (227, 3)
In [19]: df.head()
Out[19]:
Energy Supply Energy Supply per capita \
NaN Afghanistan Afghanistan 321 10
Albania Albania 102 35
Algeria Algeria 1959 51
American Samoa American Samoa ... ...
Andorra Andorra 9 121
Renewable Electricity Production
NaN Afghanistan Afghanistan 78.669280
Albania Albania 100.000000
Algeria Algeria 0.551010
American Samoa American Samoa 0.641026
Andorra Andorra 88.695650
In [20]: pd.__version__
Out[20]: u'0.20.3'
In [21]: df.columns
Out[21]:
Index([u'Energy Supply', u'Energy Supply per capita',
u'Renewable Electricity Production'],
dtype='object')
Notice that I am using the last version of pandas 0.20.3
make sure you have the latest version on your system.
I modified your code and was able to get the data into the dataframe. Instead of skipcolumns
(which did not work), I used the argument usecols
as follows
energy=pd.read_excel('Energy_Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[16],
skipfooter=38,
usecols=[2,3,4,5]
)
Unnamed: 2 Petajoules Gigajoules %
0 Afghanistan 321 10 78.669280
1 Albania 102 35 100.000000
2 Algeria 1959 51 0.551010
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.695650
In order to make the countries as the index, you can do the following
# Rename the column Unnamed: 2 to Country
energy = energy.rename(columns={'Unnamed: 2':'Country'})
# Change the index to country column
energy.index = energy['Country']
# Drop the extra country column
energy = energy.drop('Country', axis=1)