Python dataframe interaction

2019-08-28 05:34发布

问题:

I have the following dataframe:

topic  student level week
1      a       1     1
1      b       2     1
1      a       3     1
2      a       1     2
2      b       2     2
2      a       3     2
2      b       4     2

The new dataframe should represent an interaction between students through the topic. It should contain four columns: "student source", "student destination", "week" and "reply count".

Student Destination is a student that each student shared the topic with.

Reply count is a number of times in which Student Destination "directly" replied to Student Source.

The new dataframe should look like:

st_source st_dest  week  reply_count
    a        b       1        1
    a        b       2        2
    b        a       1        1
    b        a       2        1

Reply count can be explained easier with an example.

If a thread is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), C replied to B (sending a message at level 3). Then B "directly" replied to A, and C "directly" replied to B, but C's reply to A is not direct (and so we don't count it).

Does anyone have some idea?

Thank you in advance!

回答1:

result = (df.groupby('week').apply(
        lambda g: g.groupby([g.student.shift(), g.student])
        .week.agg({'reply_count': 'count'})
        .rename_axis(("st_source", "st_dest"))
    ).reset_index())
​
result[['st_source', 'st_dest', 'week', 'reply_count']].sort_values(['st_source', 'st_dest'])

# st_source     st_dest   week  reply_count
#0        a         b        1          1
#2        a         b        2          2
#1        b         a        1          1
#3        b         a        2          1