I am using python and multiple libaries like pandas and scipy to prepare data so I can start deeper analysis. For the preparation purpose I am for instance creating new columns with the difference of two dates.
My code is providing the expected results but is really slow so I cannot use it for a table with like 80K rows. The run time would take ca. 80 minutes for the table just for this simple operation.
The problem is definitely related with my writing operation:
tableContent[6]['p_test_Duration'].iloc[x] = difference
Moreover python is providing a Warning:
complete code example for date difference:
import time
from datetime import date, datetime
tableContent[6]['p_test_Duration'] = 0
#for x in range (0,len(tableContent[6]['p_test_Duration'])):
for x in range (0,1000):
p_test_ZEIT_ANFANG = datetime.strptime(tableContent[6]['p_test_ZEIT_ANFANG'].iloc[x], '%Y-%m-%d %H:%M:%S')
p_test_ZEIT_ENDE = datetime.strptime(tableContent[6]['p_test_ZEIT_ENDE'].iloc[x], '%Y-%m-%d %H:%M:%S')
difference = p_test_ZEIT_ENDE - p_test_ZEIT_ANFANG
tableContent[6]['p_test_Duration'].iloc[x] = difference
the correct result table:
Take away the loop, and apply the functions to the whole series.
ZEIT_ANFANG = tableContent[6]['p_test_ZEIT_ANFANG'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
ZEIT_ENDE = tableContent[6]['p_test_ZEIT_ENDE'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
tableContent[6]['p_test_Duration'] = ZEIT_ENDE - ZEIT_ANFANG
You can vectorize the conversion of dates by using pd.to_datetime
and avoid using apply
unnecessarily.
tableContent[6]['p_test_Duration'] = (
pd.to_datetime(tableContent[6]['p_test_ZEIT_ENDE']) -
pd.to_datetime(tableContent[6]['p_test_ZEIT_ANFANG'])
)
Also, you were getting the SettingWithCopy
warning because of the chained indexing assingnment
tableContent[6]['p_test_Duration'].iloc[x] = difference
Which you don't have to worry about if you go about it in the way I suggested.
The other answers are fine, but I would recommend that you avoid chained indexing in general. The pandas docs explicitly discourage chained indexing as it either produces unreliable results or is slow (due to multiple calls to __getitem__). Assuming your data frame is multi-indexed, you might replace:
tableContent[6]['p_test_Duration'].iloc[x] = difference
with:
tableContent.loc[x, (6, 'p_test_Duration')] = difference
You can sometimes get around this issue, but why not learn the method least likely to cause problems in the future?