I have two columns which I stored sets in my dataframe.
I want to perform set union on the two columns using fast vectorized operation
df['union'] = df.set1 | df.set2
but the error TypeError: unsupported operand type(s) for |: 'set' and 'bool'
is preventing me from doing so as I have type np.nan
in both columns.
Is there a good solution to overcome this?
For these operations pure Python may be more efficient.
%timeit pd.Series([set1.union(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 43.3 ms per loop
%timeit df.apply(lambda x: x.A.union(x.B), axis=1)
1 loop, best of 3: 2.6 s per loop
If we could use +
, it would probably take half the time (inheritance may not worth it):
%timeit df['A'] - df['B']
10 loops, best of 3: 22.1 ms per loop
%timeit pd.Series([set1.difference(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 35.7 ms per loop
DataFrame for timings:
import pandas as pd
import numpy as np
l1 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
l2 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
df = pd.DataFrame({'A': l1, 'B': l2})
This is the best I could come up with:
# method 1
df.apply(lambda x: x.set1.union(x.set2), axis=1)
# method 2
df.applymap(list).sum(1).apply(set)
Wow!
I expected the method 2 to be quicker. Not so!
Example
df = pd.DataFrame([[{1, 2, 3}, {3, 4, 5}] for _ in range(3)],
columns=list('AB'))
df
df.apply(lambda x: x.set1.union(x.set2), axis=1)
0 {1, 2, 3, 4, 5}
1 {1, 2, 3, 4, 5}
2 {1, 2, 3, 4, 5}