I've got some grouped tabular data, and in this data there's a column for which each data point can actually have a set of different values. I'm trying to calculate the difference of that set from that of its preceding data point in the group it's a member of. For example, given the data below, I'm trying to calculate the difference of the values of Tokens
† for Timestep
value n from the values of Tokens
for the row with Timestamp
value n - 1 for each Dyad,Participant
combination:
| Dyad | Participant | Timestep | Tokens |
|------|-------------|----------|-------------------|
| 1 | A | 1 | apple,banana |
| 1 | B | 1 | apple,orange |
| 1 | A | 2 | banana |
| 1 | B | 2 | orange,kumquat |
| 1 | A | 3 | orange |
| 1 | B | 3 | orange,pear |
| 2 | A | 1 | orange,pear |
| 2 | B | 1 | apple,banana,pear |
| 2 | A | 2 | banana,persimmon |
| 2 | B | 2 | apple |
| 2 | A | 3 | banana |
| 2 | B | 3 | apple |
How can I best accomplish this using pandas?
Expected result
I ultimately want to create a new column with the output of a function token_overlap(data)
which computes the ratio of Token
values that overlap with the values of the datapoint preceding it:
| Dyad | Participant | Timestep | Tokens | TokenOverlap |
|------|-------------|----------|-------------------| -------------|
| 1 | A | 1 | apple,banana | (no value) |
| 1 | B | 1 | apple,orange | (no value) |
| 1 | A | 2 | banana | 0.5 |
| 1 | B | 2 | orange,kumquat | 0.333 |
| 1 | A | 3 | orange | 0 |
| 1 | B | 3 | orange,pear | 0.333 |
| 2 | A | 1 | orange,pear | (no value) |
| 2 | B | 1 | apple,banana,pear | (no value) |
| 2 | A | 2 | banana,persimmon | 0 |
| 2 | B | 2 | apple | 0.333 |
| 2 | A | 3 | banana | 0.5 |
| 2 | B | 3 | apple | 1 |
Current approach
I convert the multi-values into a frozenset
by using the converters
keyword of pandas.read_csv(...)
:
def parse_set(cell_value: str) -> FrozenSet[str]:
return frozenset(cell_value.split(','))
round_tokens = pandas.read_csv(inpath, converters={"Tokens": parse_set})
I then create groups of Dyad,Participant
datapoints using pandas.DataFrame.groupby(..)
:
round_tokens.sort_values(["Dyad", "Timestep"])
dyad_participants = round_tokens.groupby(["Dyad", "Participant"])
However, I am unsure how to get each row and it's precedessor's Tokens
value (which should be a frozenset
): I have some function which tries to do so but I'm unsure if the function itself is wrong or if I'm extracting the row data incorrectly.
def token_overlap(data):
own_relevant_tokens = data["Tokens"]
prev_tokens = data.shift(-1)["Tokens"]
overlap = own_relevant_tokens.intersection(prev_tokens)
union = own_relevant_tokens.union(prev_tokens)
return len(overlap) / len(union)
round_tokens["TokenOverlap"] = dyad_participants.apply(token_overlap)
However, this doesn't actually work: The actual error is
AttributeError: 'Series' object has no attribute 'union'
but I know I'm not using/understanding/grokking the pandas API correctly, hence the extremely long-winded question. How can I group my data and then, inside each group, calculate a metric using set-like values from one row and the same column's values for the row preceding it?
†In the real data, there are over 1,000 possible values for Tokens
, so, at least to me, this task would be even harder if I enumerated the presence of each token as Boolean values, e.g. Token_Apple
, Token_Banana
, etc.
Setup
In a nutshell, what this code is doing is, grouping by
Dyad
andParticipant
, and then finding pairwise ratio. This needs some complicatedgroupby
andapply
, since we need to do a few setunion
anddifference
operations. The The core logic is inside thegroupby.apply
, while the rest is just prettification.This code runs in:
Breakdown