Removing usernames from a dataframe that do not ap

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:

df.groupby('userName').filter(lambda x: len(x) > 4)

I am also open to alternative solutions/approaches that are easy to understand.

标签： python pandas dataframe filtering grouping

2条回答

别忘想泡老子

2楼-- · 2019-07-15 01:34

You can check filtration.

Faster solution in bigger DataFrame is with transform and boolean indexing:

df[df.groupby('userName')['userName'].transform('size') > 4]

Sample:

df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})

print (df.groupby('userName').filter(lambda x: len(x) > 4))
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

print (df[df.groupby('userName')['userName'].transform('size') > 4])
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

Timings:

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)

In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop

In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop

In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop

0人赞添加讨论(0) 举报

来，给爷笑一个

3楼-- · 2019-07-15 01:42

Using numpy

def pir(df, k):
    names = df.userName.values
    f, u = pd.factorize(names)
    c = np.bincount(f)
    m = c[f] > k
    return df[m]

pir(df, 4)

   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

Timing
@jezrael's large data

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})

pir(df, 1000).equals(
    df[df.groupby('userName')['userName'].transform('size') > 1000]
)

True

%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)

10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

0人赞添加讨论(0) 举报

Removing usernames from a dataframe that do not ap

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间