可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:

df.groupby('userName').filter(lambda x: len(x) > 4)

I am also open to alternative solutions/approaches that are easy to understand.

回答1:

You can check filtration.

Faster solution in bigger DataFrame is with transform and boolean indexing:

df[df.groupby('userName')['userName'].transform('size') > 4]

Sample:

df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})

print (df.groupby('userName').filter(lambda x: len(x) > 4))
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

print (df[df.groupby('userName')['userName'].transform('size') > 4])
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

Timings:

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)

In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop

In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop

In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop

回答2:

Using numpy

def pir(df, k):
    names = df.userName.values
    f, u = pd.factorize(names)
    c = np.bincount(f)
    m = c[f] > k
    return df[m]

pir(df, 4)

   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

Timing
@jezrael's large data

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})

pir(df, 1000).equals(
    df[df.groupby('userName')['userName'].transform('size') > 1000]
)

True

%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)

10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop