How to load an excel sheet and clean the data in p

2020-08-09 04:16发布

站内文章 / Python

61 0

走好不送

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Load the energy data from the file Energy Indicators.xls, which is a list of indicators of energy supply and renewable electricity production from the United Nations for the year 2013, and should be put into a DataFrame with the variable name of energy.

Keep in mind that this is an Excel file, and not a comma separated values file. Also, make sure to exclude the footer and header information from the datafile. The first two columns are unneccessary, so you should get rid of them, and you should change the column labels so that the columns are:

['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'] Convert Energy Supply to gigajoules (there are 1,000,000 gigajoules in a petajoule). For all countries which have missing data (e.g. data with "...") make sure this is reflected as np.NaN values.

Rename the following list of countries (for use in later questions): "Republic of Korea": "South Korea", "United States of America": "United States", "United Kingdom of Great Britain and Northern Ireland": "United Kingdom", "China, Hong Kong Special Administrative Region": "Hong Kong"

There are also several countries with numbers and/or parenthesis in their name. Be sure to remove these, e.g. 'Bolivia (Plurinational State of)' should be 'Bolivia', 'Switzerland17' should be 'Switzerland'.

Next, load the GDP data from the file world_bank.csv, which is a csv containing countries' GDP from 1960 to 2015 from World Bank. Call this DataFrame GDP. Make sure to skip the header, and rename the following list of countries: "Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"

Finally, load the Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology from the file scimagojr-3.xlsx, which ranks countries based on their journal contributions in the aforementioned area. Call this DataFrame ScimEn.

Join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names). Use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).

The index of this DataFrame should be the name of the country, and the columns should be ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'].

This function should return a DataFrame with 20 columns and 15 entries.

I have tried the following code for this question, but it is returning only 12 rows instead of 15:

import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

Energy = pd.read_excel('Energy Indicators.xls')

Energy.drop(Energy.columns[[0,1]],axis=1,inplace=True)

Energy.columns=['Country','Energy Supply','Energy Supply per capita','% Renewable']

Energy['Energy Supply']*=1000000

Energy['Country'] = Energy['Country'].str.replace(r"\(.*\)","")

Energy['Country'] = Energy['Country'].str.replace("[0-9()]+$", "")

Energy.replace('Republic of Korea','South Korea', inplace = True)

Energy.replace('United States of America','United States', inplace = True)

Energy.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom', inplace = True)

Energy.replace('China, Hong Kong Special Administrative Region','Hong Kong', inplace = True)

import pandas as pd

GDP = pd.read_csv('world_bank.csv', index_col=0, header=None)

GDP = GDP.drop(['Data Source'])

GDP = GDP.dropna()

GDP = GDP.reset_index()

GDP.columns = GDP.iloc[0]

GDP.drop(GDP.index[[0,3]], inplace=True)

GDP = GDP.rename(columns={'Country Name': 'Country'})

GDP.replace(',','-', inplace=True)

GDP = GDP.replace('Korea, Rep.','South Korea')

GDP = GDP.replace('Iran, Islamic Rep.','Iran')

GDP = GDP.replace('Hong Kong SAR, China','Hong Kong')


import pandas as pd

from pandas import ExcelWriter

from pandas import ExcelFile

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)

ScimEn = pd.read_excel('scimagojr-3.xlsx')


b = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = pd.merge(pd.merge(Energy,GDP,on='Country'),ScimEn,on='Country')

a = a.sort(['Rank'], ascending=[True])

a = a[a["Rank"] < 16]

a=a.rename(columns = {'2006.0':'abc'})

a.columns.values[53] = "2006"

a.columns.values[54] = "2007"

a.columns.values[55] = "2008"

a.columns.values[56] = "2009"

a.columns.values[57] = "2010"

a.columns.values[58] = "2011"

a.columns.values[59] = "2012"

a.columns.values[60] = "2013"

a.columns.values[61] = "2014"

a.columns.values[62] = "2015"


a = a[['Country','Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]


a = a.set_index('Country')


def ans():

    return a

ans()

回答1:

import numpy as np 
import pandas as pd 


def energy():
    energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
    energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
    energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

    energy = energy.replace('...', np.nan)
    energy['Energy Supply']=energy['Energy Supply']*1000000

    energy = energy.replace("Republic of Korea", "South Korea")
    energy = energy.replace("United States of America", "United States")
    energy = energy.replace("United Kingdom of Great Britain and Northern Ireland","United Kingdom")
    energy = energy.replace("China, Hong Kong Special Administrative Region", "Hong Kong")

    energy['Country'] = energy['Country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()   

    energy=energy.reset_index()
    energy=energy[['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']]
    return energy.iloc[43]

def GDP():
    GDP=pd.read_csv('world_bank.csv')
    s=(GDP.iloc[3].values)[:4].astype(str).tolist()+(GDP.iloc[3].values)[4:].astype(int).astype(str).tolist()
    GDP=GDP.iloc[4:]
    GDP.columns=s
    GDP=GDP[['Country Name','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    GDP.columns=['Country','2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
    GDP=GDP.replace("Korea, Rep.", "South Korea",regex=False)

    GDP=GDP.replace("Iran, Islamic Rep.","Iran")

    GDP=GDP.replace("Hong Kong SAR, China","Hong Kong",regex=False)
    return GDP

def ScimEn():
    ScimEn=pd.ExcelFile('scimagojr-3.xlsx').parse('Sheet1')

    return ScimEn

def result():   
    e= energy()
    G=GDP()
    S=ScimEn()
    tdf=pd.merge(e,G,on='Country')
    tdf=pd.merge(tdf,S,on='Country')
    res = tdf.sort_values(by=['Rank'], inplace = True)
    res = tdf.head(15)
    res=res.set_index('Country', inplace=False)
    return res

回答2:

Please ask your question one by one. However, I came up with a solution for some of the questions you asked above. You can load excel file using ex = pd.ExcelFile('Yourfilename.xls') and then check your sheet name in excel file using ex.sheet_name. next you can use the following to read that sheet you have in your excel file,

en = ex.parse('sheetname', skiprows = 2, skip_footer =True,..)

then, you can replace the special characters using the following syntax

en.replace('$%^',np.NaN, inplace =True)

now you can check your dataframe's header and see and then check how many rows you still have to skip and other details.

to drop columns, you can use the following syntax

en.drop([dol for col in ['colname1', 'colname2', ...] if col in en], axis =1, inplace =True)

This is it for now from the big question. Work on this and if this does what you want, make is as the answer for the parts to read excel file, skip rows, replace NaN, and skip columns.

回答3:

def answer_one():

    import pandas as pd
    import numpy as np

    energy=pd.read_excel('Energy Indicators.xls',skiprows=9)
    energy=energy.drop(['Unnamed: 0','Unnamed: 1'],axis=1)
    energy=energy.drop([0],axis=0)
    energy=energy.rename(columns={'Country':'Country','Energy Supply':'Energy Supply','Energy Supply per capita':'Energy Supply per Capita','Renewable Electricity Production':'% Renewable'})
    energy['Energy Supply'] *=1000000
    energy=energy.replace('...',np.nan)

    energy['Country']=energy['Country'].str.replace(r"( \(.*\))","")
    energy['Country']=energy['Country'].str.replace(r"([0-9])","")

    energy=energy.replace('Republic of Korea','South Korea')                   #Method1
    energy.replace('United States of America','United States',inplace=True)    #Method2
    energy['Country'] = energy['Country'].replace({'China, Hong Kong Special Administrative Region':'Hong Kong','United Kingdom of Great Britain and Northern Ireland':'United Kingdom','Iran (Islamic Republic of)':'Iran'}) #Method3

    cmp=energy['Country']
    energy=energy[energy['Country']== cmp.values]
    energy.reset_index()

    ##############################################################################################
    GDP=pd.read_csv('world_bank.csv',skiprows=4)
    GDP=GDP.replace('Korea, Rep.','South Korea')
    GDP=GDP.replace('Iran, Islamic Rep.','Iran')
    GDP=GDP.replace('Hong Kong SAR, China','Hong Kong')

    GDP [ GDP['Country Name']=='Hong Kong' ]
    GDP=GDP[['Country Name','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015']]
    GDP=GDP.rename(columns={'Country Name':'Country'})
    ##############################################################################################

    ScimEn=pd.read_excel('scimagojr-3.xlsx')
    ScimEn=ScimEn.nsmallest(15,'Rank')

    ##############################################################################################

    merge_1=pd.merge(energy,ScimEn,how='inner',left_on='Country',right_on='Country')
    final_merge=pd.merge(merge_1,GDP,on='Country')
    final_merge=final_merge[['Country','Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']]
    final_merge=final_merge.set_index('Country')


    return final_merge

answer_one()

回答4:

def answer_one():
    import pandas as pd
    energy=pd.read_excel('Energy Indicators.xls', skiprows=2)
    energy.columns=['a','b','Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
    del energy['a']
    del energy['b']
    energy['Energy Supply']*=1000000
    energy['Country'] = energy['Country'].str.replace(r"\(.*\)","")
    energy['Country'] = energy['Country'].str.replace("[0-9()]+$", "")
    energy.replace('Republic of Korea','South Korea', inplace = True)
    energy.replace('United States of America','United States', inplace = True)
    energy.replace('United Kingdom of Great Britain and Northern Ireland','United Kingdom', inplace = True)
    energy.replace('China, Hong Kong Special Administrative Region','Hong Kong', inplace = True)
    GDP=pd.read_csv('world_bank.csv',skiprows=4)
    GDP.replace('Korea, Rep.','South Korea')
    GDP.replace('Iran, Islamic Rep.','Iran')
    GDP.replace('Hong Kong SAR, China' , 'Hong Kong')
    ScimEn=pd.read_excel('scimagojr-3.xlsx')
    GDP.columns=['Country', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015']
    for i in ['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005']:
        del GDP[i]
    ScimEn=ScimEn[ScimEn['Rank']<16]
    x=pd.merge(GDP,ScimEn,how='inner',left_on='Country',right_on='Country')
    y=pd.merge(x,energy,how='inner',left_on='Country',right_on='Country')
    y=y.set_index('Country')
    del y['Country Code']
    del y['Indicator Name']
    del y['Indicator Code']
    return y

answer_one()