Lets say I want to create and fill an empty dataframe with values from a loop.
import pandas as pd
import numpy as np
years = [2013, 2014, 2015]
dn=pd.DataFrame()
for year in years:
df1 = pd.DataFrame({'Incidents': [ 'C', 'B','A'],
year: [1, 1, 1 ],
}).set_index('Incidents')
print (df1)
dn=dn.append(df1, ignore_index = False)
The append gives a diagonal matrix even when ignore index is false:
>>> dn
2013 2014 2015
Incidents
C 1 NaN NaN
B 1 NaN NaN
A 1 NaN NaN
C NaN 1 NaN
B NaN 1 NaN
A NaN 1 NaN
C NaN NaN 1
B NaN NaN 1
A NaN NaN 1
[9 rows x 3 columns]
It should look like this:
>>> dn
2013 2014 2015
Incidents
C 1 1 1
B 1 1 1
A 1 1 1
[3 rows x 3 columns]
Is there a better way of doing this? and is there a way to fix the append?
I have pandas version '0.13.1-557-g300610e'
As far as I know you should avoid to add line by line to the dataframe due to speed issue
What I usually do is:
yields
Note that calling
pd.concat
once outside the loop is more time-efficient than callingpd.concat
with each iteration of the loop.Each time you call
pd.concat
new space is allocated for a new DataFrame, and all the data from each component DataFrame is copied into the new DataFrame. If you callpd.concat
from within the for-loop then you end up doing on the order ofn**2
copies, wheren
is the number of years.If you accumulate the partial DataFrames in a list and call
pd.concat
once outside the list, then Pandas only needs to performn
copies to makedn
.