I have to filter a data frame using as criterion those row in which is contained the string RTB
.
I'm using dplyr
.
d.del <- df %.%
group_by(TrackingPixel) %.%
summarise(MonthDelivery = as.integer(sum(Revenue))) %.%
arrange(desc(MonthDelivery))
I know I can use the function filter
in dplyr
but I don't exactly how to tell it to check for the content of a string.
In particular I want to check the content in the column TrackingPixel
. If the string contains the label RTB
I want to remove the row from the result.
The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter
like this:
dplyr::filter(df, !grepl("RTB",TrackingPixel))
Since you have not provided the original data, I will add a toy example using the mtcars
data set. Imagine you are only interested in cars produced by Mazda or Toyota.
mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))
mpg cyl disp hp drat wt qsec vs am gear carb type
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter
command looks like this:
dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))
Solution
It is possible to use str_detect
of the stringr
package included in the tidyverse
package. str_detect
returns True
or False
as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr
package.
library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
# ✔ tibble 1.4.2 ✔ dplyr 0.7.4
# ✔ tidyr 0.7.2 ✔ stringr 1.2.0
# ✔ readr 1.1.1 ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag() masks stats::lag()
mtcars$type <- rownames(mtcars)
mtcars %>%
filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl disp hp drat wt qsec vs am gear carb type
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
# 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
# 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
The good things about Stringr
We should use rather stringr::str_detect()
than base::grepl()
. This is because there are the following reasons.
- The functions provided by the
stringr
package start with the prefix str_
, which makes the code easier to read.
- The first argument of the functions of
stringr
package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"
Benchmark
The results of the benchmark test are as follows. For large dataframe, str_detect
is faster.
library(rbenchmark)
library(tidyverse)
# The data. Data expo 09. ASA Statistics Computing and Graphics
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728 29
benchmark(
"str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
"grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
replications = 10,
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2 grepl 10 16.480 1.513 16.195 0.248
# 1 str_detect 10 10.891 1.000 9.594 1.281
This answer similar to others, but using preferred stringr::str_detect
and dplyr rownames_to_column
.
library(tidyverse)
mtcars %>%
rownames_to_column("type") %>%
filter(stringr::str_detect(type, 'Toyota|Mazda') )
#> type mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 4 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Created on 2018-06-26 by the reprex package (v0.2.0).
If you want to find the string in any given column, have a look at
Remove row if any column contains a specific string
It is bascially about using filter_at
or filter_all