How to get count of values based on datetime in Py

2020-08-09 06:01发布

问题:

I have written the following code which creates two dataframes nq and cmnt.
nq contains the UserId and corresponding time of Badge Attainment date.
cmnt contains OwnerUserId and the time when the User made a comment CreationDate.
I want to get a count of the comments made for all days before and after 1 week of badge attainment so that I can create a time series line plot out of it.

The following code perform the same but produces a KeyError. Please provide a code that performs this operations for all users.

nq

 UserId |   date 
     1      2009-10-17 17:38:32.590
     2      2009-10-19 00:37:23.067
     3      2009-10-20 08:37:14.143
     4      2009-10-21 18:07:51.247
     5      2009-10-22 21:25:24.483

cmnt

OwnerUserId | CreationDate
1             2009-10-16 17:38:32.590
1             2009-10-18 17:38:32.590
2             2009-10-18 00:37:23.067
2             2009-10-17 00:37:23.067
2             2009-10-20 00:37:23.067
3             2009-10-19 08:37:14.143
4             2009-10-20 18:07:51.247
5             2009-10-21 21:25:24.483

code

 nq.date = pd.to_datetime(nq.date)
 cmnt.CreationDate = pd.to_datetime(cmnt.CreationDate)

 count= []
   
 for j in range(len(nq)): 
      for i in range(-7,8):
        
          check_date = nq.date.iloc[j] + timedelta(days=i)
          
          count = cmnt.loc[(cmnt.OwnerUserId == nq.UserId.iloc[j]) & (cmnt.CreationDate == check_date)].shape[0]
          nq.iloc[j].append({nq[i]:count})

expected output

UserId     |   date                 |-7|-6|-5|-4|-3|-2|-1|0 |1 |2 |3 |4 |5 |6 |7
     1      2009-10-17 17:38:32.590 |0 |0 |0 |0 |0 |0 |1 |0 |1 |0 |0 |0 |0 |0 |0  
     2      2009-10-19 00:37:23.067 |0 |0 |0 |0 |0 |1 |1 |0 |1 |0 |0 |0 |0 |0 |0    
     3      2009-10-20 08:37:14.143 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0 
     4      2009-10-21 18:07:51.247 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0 
     5      2009-10-22 21:25:24.483 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0 

Here column -1 means comment made 1 day before badge attainment and 1 means comment made one day after badge attainment and so on.

Note There can be a completely alternately way to do this. My main objective is to draw a time series line plot which shows the number of comments made by the users before and after attainment of the badge.

回答1:

Here's a way to do that:

t = pd.merge(nq, cmnt, left_on="UserId", right_on = "OwnerUserId")
t["days_diff"] = (t["CreationDate"] - t["date"]).dt.days
t["count"] = t.groupby(["UserId", "days_diff"]).OwnerUserId.transform("count")

all_days = pd.DataFrame(itertools.product(t.UserId.unique(), range(-7, 8)), )
all_days.columns = ["UserId", "day"]

t = pd.merge(t, all_days, left_on=["UserId", "days_diff"], right_on=["UserId", "day"], how = "right")
t = pd.pivot_table(t, index="UserId", columns="day", values="count", dropna=False)

res = pd.merge(nq, t, left_on="UserId", right_index=True)

print(res)

The output is:

   UserId                    date  -7  -6  -5  -4  -3   -2   -1   0    1   2   3   4   5   6   7
0       1 2009-10-17 17:38:32.590 NaN NaN NaN NaN NaN  NaN  1.0 NaN  1.0 NaN NaN NaN NaN NaN NaN
1       2 2009-10-19 00:37:23.067 NaN NaN NaN NaN NaN  1.0  1.0 NaN  1.0 NaN NaN NaN NaN NaN NaN
2       3 2009-10-20 08:37:14.143 NaN NaN NaN NaN NaN  NaN  1.0 NaN  NaN NaN NaN NaN NaN NaN NaN
3       4 2009-10-21 18:07:51.247 NaN NaN NaN NaN NaN  NaN  1.0 NaN  NaN NaN NaN NaN NaN NaN NaN
4       5 2009-10-22 21:25:24.483 NaN NaN NaN NaN NaN  NaN  1.0 NaN  NaN NaN NaN NaN NaN NaN NaN