I am using python and multiple libaries like pandas and scipy to prepare data so I can start deeper analysis. For the preparation purpose I am for instance creating new columns with the difference of two dates.
My code is providing the expected results but is really slow so I cannot use it for a table with like 80K rows. The run time would take ca. 80 minutes for the table just for this simple operation.

The problem is definitely related with my writing operation:

tableContent[6]['p_test_Duration'].iloc[x] = difference

Moreover python is providing a Warning:

complete code example for date difference:

import time
from datetime import date, datetime

tableContent[6]['p_test_Duration'] = 0

#for x in range (0,len(tableContent[6]['p_test_Duration'])):
for x in range (0,1000):
    p_test_ZEIT_ANFANG = datetime.strptime(tableContent[6]['p_test_ZEIT_ANFANG'].iloc[x], '%Y-%m-%d %H:%M:%S')
    p_test_ZEIT_ENDE = datetime.strptime(tableContent[6]['p_test_ZEIT_ENDE'].iloc[x], '%Y-%m-%d %H:%M:%S')
    difference = p_test_ZEIT_ENDE - p_test_ZEIT_ANFANG

    tableContent[6]['p_test_Duration'].iloc[x] = difference

the correct result table:

标签： python python-3.x pandas chained-assignment

3条回答

我命由我不由天

2楼-- · 2019-07-18 10:21

You can vectorize the conversion of dates by using pd.to_datetime and avoid using apply unnecessarily.

tableContent[6]['p_test_Duration'] = (
    pd.to_datetime(tableContent[6]['p_test_ZEIT_ENDE']) -
    pd.to_datetime(tableContent[6]['p_test_ZEIT_ANFANG'])
)

Also, you were getting the SettingWithCopy warning because of the chained indexing assingnment

tableContent[6]['p_test_Duration'].iloc[x] = difference

Which you don't have to worry about if you go about it in the way I suggested.

0人赞添加讨论(0) 举报

倾城　Initia

3楼-- · 2019-07-18 10:24

Take away the loop, and apply the functions to the whole series.

ZEIT_ANFANG = tableContent[6]['p_test_ZEIT_ANFANG'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
ZEIT_ENDE = tableContent[6]['p_test_ZEIT_ENDE'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
tableContent[6]['p_test_Duration'] = ZEIT_ENDE - ZEIT_ANFANG

0人赞添加讨论(0) 举报

虎瘦雄心在

4楼-- · 2019-07-18 10:33

The other answers are fine, but I would recommend that you avoid chained indexing in general. The pandas docs explicitly discourage chained indexing as it either produces unreliable results or is slow (due to multiple calls to __getitem__). Assuming your data frame is multi-indexed, you might replace:

tableContent[6]['p_test_Duration'].iloc[x] = difference

with:

tableContent.loc[x, (6, 'p_test_Duration')] = difference

You can sometimes get around this issue, but why not learn the method least likely to cause problems in the future?

0人赞添加讨论(0) 举报

python performance problems using loops with big t

Moreover python is providing a Warning:

complete code example for date difference:

the correct result table:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间