I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame
is with transform
and boolean indexing
:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
@jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop