I have a dataframe (df) like this:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
0 -5 -2.5
0 -5 -1.25
0 -5 -1.875
The rows are in the same chunk if with certain/same v1
and v2
. In this case, rows with([0,-30], [0,-10], [0,-5])
. I want to split the rows in chunks and count the number of rows in this chunk. If the length of the rows is not 6, then remove the whole chunk, otherwise, keep this chunk.
My rough codes:
v1_ls = df.v1.unique()
v2_ls = df.v2.unique()
for i, j in v1_ls, v2_ls:
chunk[i] = df[(df['v1'] == v1_ls[i]) & df['v2'] == v2_ls[j]]
if len(chunk[i])!= 6:
df = df[df != chunk[i]]
else:
pass
expected output:
v1 v2 v3
0 -30 -15
0 -30 -7.5
0 -30 -11.25
0 -30 -13.125
0 -30 -14.0625
0 -30 -13.59375
0 -10 -5
0 -10 -7.5
0 -10 -6.25
0 -10 -5.625
0 -10 -5.9375
0 -10 -6.09375
Thanks!
Use
groupby
+count/size
:Use the mask to filter
df
:count
does not countNaN
s, whilesize
does. Use whatever is appropriate for you.You can use the
filter
groupby method:I think in
v1
andv2
are noNaN
s, so usetransform
+size
:Detail:
Unfortunately
filter
is really slow, so if need better performance usetransform
:Caveat
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.