Difference of sets of multiple values for single c

I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference of that set from that of its preceding data point in the group it's a member of. For example, given the data below, I'm trying to calculate the difference of the values of Tokens^† for Timestep value n from the values of Tokens for the row with Timestamp value n - 1 for each Dyad,Participant combination:

| Dyad | Participant | Timestep | Tokens            |
|------|-------------|----------|-------------------|
| 1    | A           | 1        | apple,banana      |
| 1    | B           | 1        | apple,orange      |
| 1    | A           | 2        | banana            |
| 1    | B           | 2        | orange,kumquat    |
| 1    | A           | 3        | orange            |
| 1    | B           | 3        | orange,pear       |
| 2    | A           | 1        | orange,pear       |
| 2    | B           | 1        | apple,banana,pear |
| 2    | A           | 2        | banana,persimmon  |
| 2    | B           | 2        | apple             |
| 2    | A           | 3        | banana            |
| 2    | B           | 3        | apple             |

How can I best accomplish this using pandas?

Expected result

I ultimately want to create a new column with the output of a function token_overlap(data) which computes the ratio of Token values that overlap with the values of the datapoint preceding it:

| Dyad | Participant | Timestep | Tokens            | TokenOverlap |
|------|-------------|----------|-------------------| -------------|
| 1    | A           | 1        | apple,banana      | (no value)   |
| 1    | B           | 1        | apple,orange      | (no value)   |
| 1    | A           | 2        | banana            | 0.5          |
| 1    | B           | 2        | orange,kumquat    | 0.333        |
| 1    | A           | 3        | orange            | 0            |
| 1    | B           | 3        | orange,pear       | 0.333        |            
| 2    | A           | 1        | orange,pear       | (no value)   |
| 2    | B           | 1        | apple,banana,pear | (no value)   |
| 2    | A           | 2        | banana,persimmon  | 0            |
| 2    | B           | 2        | apple             | 0.333        |
| 2    | A           | 3        | banana            | 0.5          |
| 2    | B           | 3        | apple             | 1            |

Current approach

I convert the multi-values into a frozenset by using the converters keyword of pandas.read_csv(...):

def parse_set(cell_value: str) -> FrozenSet[str]:
    return frozenset(cell_value.split(','))

round_tokens = pandas.read_csv(inpath, converters={"Tokens": parse_set})

I then create groups of Dyad,Participant datapoints using pandas.DataFrame.groupby(..):

round_tokens.sort_values(["Dyad", "Timestep"])
dyad_participants = round_tokens.groupby(["Dyad", "Participant"])

However, I am unsure how to get each row and it's precedessor's Tokens value (which should be a frozenset): I have some function which tries to do so but I'm unsure if the function itself is wrong or if I'm extracting the row data incorrectly.

def token_overlap(data):
    own_relevant_tokens = data["Tokens"]
    prev_tokens = data.shift(-1)["Tokens"]
    overlap = own_relevant_tokens.intersection(prev_tokens)
    union = own_relevant_tokens.union(prev_tokens)
    return len(overlap) / len(union)

round_tokens["TokenOverlap"] = dyad_participants.apply(token_overlap)

However, this doesn't actually work: The actual error is

AttributeError: 'Series' object has no attribute 'union'

but I know I'm not using/understanding/grokking the pandas API correctly, hence the extremely long-winded question. How can I group my data and then, inside each group, calculate a metric using set-like values from one row and the same column's values for the row preceding it?

^†In the real data, there are over 1,000 possible values for Tokens, so, at least to me, this task would be even harder if I enumerated the presence of each token as Boolean values, e.g. Token_Apple, Token_Banana, etc.

标签： python pandas dataframe group-by set

1条回答

Melony?

2楼-- · 2019-07-23 10:59

Setup

df
    Dyad Participant  Timestep             Tokens
0      1           A         1       apple,banana
1      1           B         1       apple,orange
2      1           A         2             banana
3      1           B         2     orange,kumquat
4      1           A         3             orange
5      1           B         3        orange,pear
6      2           A         1        orange,pear
7      2           B         1  apple,banana,pear
8      2           A         2   banana,persimmon
9      2           B         2              apple
10     2           A         3             banana
11     2           B         3              apple

tokens = df.Tokens.str.split(',', expand=False).apply(frozenset) 

tokens
0           (apple, banana)
1           (orange, apple)
2                  (banana)
3         (orange, kumquat)
4                  (orange)
5            (orange, pear)
6            (orange, pear)
7     (apple, banana, pear)
8       (persimmon, banana)
9                   (apple)
10                 (banana)
11                  (apple)
Name: Tokens, dtype: object

# union logic - https://stackoverflow.com/a/46402781/4909087
df =  df.assign(Tokens=tokens)\
        .groupby(['Dyad', 'Participant']).apply(\
               lambda x: (x.Tokens.str.len() - 
                      x.Tokens.diff().str.len()) \
                    / pd.Series([len(k[0].union(k[1])) 
   for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
        .reset_index(level=[0, 1], name='TokenOverlap')\
        .assign(Timestep=df.Timestep, Tokens=df.Tokens)\
        .sort_values(['Dyad', 'Timestep', 'Participant'])\
        .fillna('(no value)')\
         [['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

df

    Dyad Participant  Timestep             Tokens TokenOverlap
0      1           A         1       apple,banana   (no value)
1      1           B         1       apple,orange   (no value)
2      1           A         2             banana          0.5
3      1           B         2     orange,kumquat     0.333333
4      1           A         3             orange            0
5      1           B         3        orange,pear     0.333333
6      2           A         1        orange,pear   (no value)
7      2           B         1  apple,banana,pear   (no value)
8      2           A         2   banana,persimmon            0
9      2           B         2              apple     0.333333
10     2           A         3             banana          0.5
11     2           B         3              apple            1

In a nutshell, what this code is doing is, grouping by Dyad and Participant, and then finding pairwise ratio. This needs some complicated groupby and apply, since we need to do a few set union and difference operations. The The core logic is inside the groupby.apply, while the rest is just prettification.

This code runs in:

10 loops, best of 3: 19.2 ms per loop

Breakdown

df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
                   lambda x: (x.Tokens.str.len() - 
                          x.Tokens.diff().str.len()) \
                        / pd.Series([len(k[0].union(k[1])) 
       for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line

df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')    
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')    
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

0人赞添加讨论(0) 举报