Pandas not counting rows properly

2020-03-30 02:56发布

So I have this dataframe:

         filename  width  height    class  xmin  ymin  xmax  ymax
0      128782.JPG    640     512    Panel    36   385   119   510
1      128782.JPG    640     512    Panel   124   388   207   510
2      128782.JPG    640     512    Panel   210   390   294   511
3      128782.JPG    640     512    Panel   294   395   380   510
4      128782.JPG    640     512    Panel   379   398   466   511
5      128782.JPG    640     512    Panel   465   402   553   510
6      128782.JPG    640     512     P+SD   552   402   638   510
7      128782.JPG    640     512     P+SD   558   264   638   404
...
...
57170     128782.JPG    640     512     P+SD    36   242   121   383
57171     128782.JPG    640     512  HS+P+SD    36    97   122   242
57172     128782.JPG    640     512     P+SD   214   106   304   250

Which contains in the column called "class" have the unique values "Panel", "P+SD" and "HS+P+SD". I want to count how many rows there are with these values so I tried this:

print(len(split_df[split_df["class"].str.contains('Panel')]))
print(len(split_df[split_df["class"].str.contains('HS+P+SD')]))
print(len(split_df[split_df["class"].str.contains('P+SD')]))

This gave me this output:

56988
0
0

This is incorrect as you can clearly see based on the snippet of the DataFrame provided above, why is everything counted properly for Panel but nothing is counted for the other two "class" names?

Here's the output of split_df.info:

RangeIndex: 57172 entries, 0 to 57171
Data columns (total 8 columns):
filename    57172 non-null object
width       57172 non-null int64
height      57172 non-null int64
class       57172 non-null object
xmin        57172 non-null int64
ymin        57172 non-null int64
xmax        57172 non-null int64
ymax        57172 non-null int64
dtypes: int64(6), object(2)
memory usage: 3.5+ MB

I cannot for the life of me figure out what is wrong. Any help is appreciated.

3条回答
▲ chillily
2楼-- · 2020-03-30 03:51

Also simple for loop with in will work

sum(['HS+P+SD' in x for x in df['class']])

About the timing (if you want to check this link )

df=pd.concat([df]*100)
%timeit df['class'].str.contains('HS+P+SD', regex=False).sum()
1000 loops, best of 3: 410 µs per loop
%timeit sum(['HS+P+SD' in x for x in df['class']])
10000 loops, best of 3: 123 µs per loop
查看更多
趁早两清
3楼-- · 2020-03-30 03:54

pd.Series.str.contains has regex=True by default. Since + is a special character in regex, use regex=False, re.escape, or \ escaping:

import re
s = pd.Series(['HS+P+SD', 'AB+CD+EF'])

s.str.contains('HS+P+SD').sum()               # 0
s.str.contains('HS+P+SD', regex=False).sum()  # 1
s.str.contains(re.escape('HS+P+SD')).sum()    # 1
s.str.contains('HS\+P\+SD').sum()             # 1

I want to count how many rows there are with these values

If this is your core problem and you don't want a 'P+SD' count to include 'HS+P+SD', don't use str.contains. Check for equality instead and use value_counts on the values you wish to count:

L = ['Panel', 'HS+P+SD', 'P+SD']
counts = df.loc[df['class'].isin(L), 'class'].value_counts()

Or for all counts just use df['class'].value_counts().

查看更多
Viruses.
4楼-- · 2020-03-30 03:54

Try:

print(len(split_df[split_df["class"].str == 'HS+P+SD']))
查看更多
登录 后发表回答