Removing usernames from a dataframe that do not ap

2019-07-15 00:59发布

问题:

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:

df.groupby('userName').filter(lambda x: len(x) > 4)

I am also open to alternative solutions/approaches that are easy to understand.

回答1:

You can check filtration.

Faster solution in bigger DataFrame is with transform and boolean indexing:

df[df.groupby('userName')['userName'].transform('size') > 4]

Sample:

df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})

print (df.groupby('userName').filter(lambda x: len(x) > 4))
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

print (df[df.groupby('userName')['userName'].transform('size') > 4])
   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

Timings:

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)

In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop

In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop

In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop


回答2:

Using numpy

def pir(df, k):
    names = df.userName.values
    f, u = pd.factorize(names)
    c = np.bincount(f)
    m = c[f] > k
    return df[m]

pir(df, 4)

   userName
0         a
1         a
2         a
3         a
4         a
8         c
9         c
10        c
11        c
12        c
13        c

__

Timing
@jezrael's large data

np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})

pir(df, 1000).equals(
    df[df.groupby('userName')['userName'].transform('size') > 1000]
)

True

%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)

10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop