Difference of sets of multiple values for single c

2019-07-23 10:32发布

I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference of that set from that of its preceding data point in the group it's a member of. For example, given the data below, I'm trying to calculate the difference of the values of Tokens for Timestep value n from the values of Tokens for the row with Timestamp value n - 1 for each Dyad,Participant combination:

| Dyad | Participant | Timestep | Tokens            |
|------|-------------|----------|-------------------|
| 1    | A           | 1        | apple,banana      |
| 1    | B           | 1        | apple,orange      |
| 1    | A           | 2        | banana            |
| 1    | B           | 2        | orange,kumquat    |
| 1    | A           | 3        | orange            |
| 1    | B           | 3        | orange,pear       |
| 2    | A           | 1        | orange,pear       |
| 2    | B           | 1        | apple,banana,pear |
| 2    | A           | 2        | banana,persimmon  |
| 2    | B           | 2        | apple             |
| 2    | A           | 3        | banana            |
| 2    | B           | 3        | apple             |

How can I best accomplish this using pandas?

Expected result

I ultimately want to create a new column with the output of a function token_overlap(data) which computes the ratio of Token values that overlap with the values of the datapoint preceding it:

| Dyad | Participant | Timestep | Tokens            | TokenOverlap |
|------|-------------|----------|-------------------| -------------|
| 1    | A           | 1        | apple,banana      | (no value)   |
| 1    | B           | 1        | apple,orange      | (no value)   |
| 1    | A           | 2        | banana            | 0.5          |
| 1    | B           | 2        | orange,kumquat    | 0.333        |
| 1    | A           | 3        | orange            | 0            |
| 1    | B           | 3        | orange,pear       | 0.333        |            
| 2    | A           | 1        | orange,pear       | (no value)   |
| 2    | B           | 1        | apple,banana,pear | (no value)   |
| 2    | A           | 2        | banana,persimmon  | 0            |
| 2    | B           | 2        | apple             | 0.333        |
| 2    | A           | 3        | banana            | 0.5          |
| 2    | B           | 3        | apple             | 1            |

Current approach

I convert the multi-values into a frozenset by using the converters keyword of pandas.read_csv(...):

def parse_set(cell_value: str) -> FrozenSet[str]:
    return frozenset(cell_value.split(','))

round_tokens = pandas.read_csv(inpath, converters={"Tokens": parse_set})

I then create groups of Dyad,Participant datapoints using pandas.DataFrame.groupby(..):

round_tokens.sort_values(["Dyad", "Timestep"])
dyad_participants = round_tokens.groupby(["Dyad", "Participant"])

However, I am unsure how to get each row and it's precedessor's Tokens value (which should be a frozenset): I have some function which tries to do so but I'm unsure if the function itself is wrong or if I'm extracting the row data incorrectly.

def token_overlap(data):
    own_relevant_tokens = data["Tokens"]
    prev_tokens = data.shift(-1)["Tokens"]
    overlap = own_relevant_tokens.intersection(prev_tokens)
    union = own_relevant_tokens.union(prev_tokens)
    return len(overlap) / len(union)

round_tokens["TokenOverlap"] = dyad_participants.apply(token_overlap)

However, this doesn't actually work: The actual error is

AttributeError: 'Series' object has no attribute 'union'

but I know I'm not using/understanding/grokking the pandas API correctly, hence the extremely long-winded question. How can I group my data and then, inside each group, calculate a metric using set-like values from one row and the same column's values for the row preceding it?


In the real data, there are over 1,000 possible values for Tokens, so, at least to me, this task would be even harder if I enumerated the presence of each token as Boolean values, e.g. Token_Apple, Token_Banana, etc.

1条回答
Melony?
2楼-- · 2019-07-23 10:59

Setup

df
    Dyad Participant  Timestep             Tokens
0      1           A         1       apple,banana
1      1           B         1       apple,orange
2      1           A         2             banana
3      1           B         2     orange,kumquat
4      1           A         3             orange
5      1           B         3        orange,pear
6      2           A         1        orange,pear
7      2           B         1  apple,banana,pear
8      2           A         2   banana,persimmon
9      2           B         2              apple
10     2           A         3             banana
11     2           B         3              apple

tokens = df.Tokens.str.split(',', expand=False).apply(frozenset) 

tokens
0           (apple, banana)
1           (orange, apple)
2                  (banana)
3         (orange, kumquat)
4                  (orange)
5            (orange, pear)
6            (orange, pear)
7     (apple, banana, pear)
8       (persimmon, banana)
9                   (apple)
10                 (banana)
11                  (apple)
Name: Tokens, dtype: object

# union logic - https://stackoverflow.com/a/46402781/4909087
df =  df.assign(Tokens=tokens)\
        .groupby(['Dyad', 'Participant']).apply(\
               lambda x: (x.Tokens.str.len() - 
                      x.Tokens.diff().str.len()) \
                    / pd.Series([len(k[0].union(k[1])) 
   for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index))\
        .reset_index(level=[0, 1], name='TokenOverlap')\
        .assign(Timestep=df.Timestep, Tokens=df.Tokens)\
        .sort_values(['Dyad', 'Timestep', 'Participant'])\
        .fillna('(no value)')\
         [['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]

df

    Dyad Participant  Timestep             Tokens TokenOverlap
0      1           A         1       apple,banana   (no value)
1      1           B         1       apple,orange   (no value)
2      1           A         2             banana          0.5
3      1           B         2     orange,kumquat     0.333333
4      1           A         3             orange            0
5      1           B         3        orange,pear     0.333333
6      2           A         1        orange,pear   (no value)
7      2           B         1  apple,banana,pear   (no value)
8      2           A         2   banana,persimmon            0
9      2           B         2              apple     0.333333
10     2           A         3             banana          0.5
11     2           B         3              apple            1

In a nutshell, what this code is doing is, grouping by Dyad and Participant, and then finding pairwise ratio. This needs some complicated groupby and apply, since we need to do a few set union and difference operations. The The core logic is inside the groupby.apply, while the rest is just prettification.

This code runs in:

10 loops, best of 3: 19.2 ms per loop

Breakdown

df2 = df.assign(Tokens=tokens)
df2 = df2.groupby(['Dyad', 'Participant']).apply(\
                   lambda x: (x.Tokens.str.len() - 
                          x.Tokens.diff().str.len()) \
                        / pd.Series([len(k[0].union(k[1])) 
       for k in zip(x.Tokens, x.Tokens.shift(1).fillna(''))], index=x.index)) # the for loop is part of this huge line

df2 = df2.reset_index(level=[0, 1], name='TokenOverlap')    
df2 = df2.assign(Timestep=df.Timestep, Tokens=df.Tokens)
df2 = df2.sort_values(['Dyad', 'Timestep', 'Participant']).fillna('(no value)')    
df2 = df2[['Dyad', 'Participant', 'Timestep', 'Tokens', 'TokenOverlap']]
查看更多
登录 后发表回答