I have this Pandas DataFrame that has a column with lists:
>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
m
0 [1, 2, 3]
1 [5, 3, 2]
2 [2, 5]
3 [3, 8, 1]
4 [9]
5 [2, 6, 3]
I want to count the number of times a list v = [2, 3]
is contained in the lists of the DataFrame. So in this example the correct answer would be 3
. Now this is just an example, in my actual data the df['m']
can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v
contain no duplicates and neither do the lists of m
, so they can be sets instead of lists.
The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v)
and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.
So for my next iteration I added a column to the DataFrame that contains a copy of the list v
:
>>> df['V'] = [[2, 3]] * len(df)
>>> df
V m
0 [2, 3] [1, 2, 3]
1 [2, 3] [5, 3, 2]
2 [2, 3] [2, 5]
3 [2, 3] [3, 8, 1]
4 [2, 3] [9]
5 [2, 3] [2, 6, 3]
and a helper function that simply returns the containment boolean like I did before:
def all_helper(l1, l2):
return all(v in l1 for v in l2)
which I can then use with np.vectorize
to add a column with the boolean value:
df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])
And lastly, calculate the sum of these booleans with a simple df['bool'].sum()
I also tried to use .apply()
:
df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()
but this was slower than the vectorisation.
Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m'])
or something, but I don't know how and if that's the best solution.
Edit: Since somebody asked; here's an example with strings instead of integers:
>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
m
0 ["aa", "ab", "ac"]
1 ["aa", "ac", "ad"]
2 ["ba", "bb"]
3 ["ac", "ca", "cc"]
4 ["aa"]
5 ["ac", "da", "aa"]
>>> count_occurrence(df, v)
3
But if you want a more extensive DataFrame, you can generate it with this:
import string
n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))
Edit:
Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize
. Wonder if anyone can beat it.
Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum()
. I continue looking for faster implementations, but this is a step in the right direction.