how to filter pandas dataframe by string?

2020-04-10 00:52发布

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

EDIT (to add samples):

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description col, so I used:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

3条回答
▲ chillily
2楼-- · 2020-04-10 01:08

It seems you need parameter flags in contains:

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBR is convert to lowercase first:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

Performance really depend on the data - size of DataFrame and number of values matching condition.

查看更多
Viruses.
3楼-- · 2020-04-10 01:11

Keep the string enclosed in quotes.

df[df['col'].str.contains('test')]

Thanks

查看更多
三岁会撩人
4楼-- · 2020-04-10 01:16

It works also OK if you add a condition

df[df['col'].str.contains('test') == True]
查看更多
登录 后发表回答