Pandas counting occurrence of list contained in co

2019-02-23 19:15发布

问题:

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

And lastly, calculate the sum of these booleans with a simple df['bool'].sum()

I also tried to use .apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

but this was slower than the vectorisation.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

Edit: Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Edit: Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

回答1:

You can utilise DataFrame.apply along with the builtin set.issubset method and then .sum() which all operate at a lower level (normally C level) than Python equivalents do.

subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()

I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.



回答2:

Since you are looking more a set-like behavior

(df.m.apply(lambda x: set(x).intersection(set([2,3]))) == set([2,3])).sum()

Returns

3