I am using python and multiple libaries like pandas and scipy to prepare data so I can start deeper analysis. For the preparation purpose I am for instance creating new columns with the difference of two dates.
My code is providing the expected results but is really slow so I cannot use it for a table with like 80K rows. The run time would take ca. 80 minutes for the table just for this simple operation.
The problem is definitely related with my writing operation:
tableContent[6]['p_test_Duration'].iloc[x] = difference
Moreover python is providing a Warning:
complete code example for date difference:
import time
from datetime import date, datetime
tableContent[6]['p_test_Duration'] = 0
#for x in range (0,len(tableContent[6]['p_test_Duration'])):
for x in range (0,1000):
p_test_ZEIT_ANFANG = datetime.strptime(tableContent[6]['p_test_ZEIT_ANFANG'].iloc[x], '%Y-%m-%d %H:%M:%S')
p_test_ZEIT_ENDE = datetime.strptime(tableContent[6]['p_test_ZEIT_ENDE'].iloc[x], '%Y-%m-%d %H:%M:%S')
difference = p_test_ZEIT_ENDE - p_test_ZEIT_ANFANG
tableContent[6]['p_test_Duration'].iloc[x] = difference
You can vectorize the conversion of dates by using
pd.to_datetime
and avoid usingapply
unnecessarily.Also, you were getting the
SettingWithCopy
warning because of the chained indexing assingnmentWhich you don't have to worry about if you go about it in the way I suggested.
Take away the loop, and apply the functions to the whole series.
The other answers are fine, but I would recommend that you avoid chained indexing in general. The pandas docs explicitly discourage chained indexing as it either produces unreliable results or is slow (due to multiple calls to __getitem__). Assuming your data frame is multi-indexed, you might replace:
with:
You can sometimes get around this issue, but why not learn the method least likely to cause problems in the future?