Speedy test on R data frame to see if row values i

2019-06-24 00:14发布

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.

  • Variable
  • FO.variable

Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX

Please let me know if there's a better way of sharing this data.

All I want to do is create an additional binary keep column which should be:

  • 1 if FO.variable is inside Variable
  • 0 if FO.Variable is not inside Variable

Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.

Here's what I've tried:

  1. Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.

  2. My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.

    transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])

  3. My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?

    kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))

  4. EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:

for(i in 1:nrow(dd)){ if(dd[i,4] %in% dd[i,2]) dd$test[i] <- 1 }

As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".

A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?

This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

3条回答
Summer. ? 凉城
2楼-- · 2019-06-24 00:21

Here is a data.table approach that I think is very similar in spirit to Martin's:

require(data.table)

dt <- data.table(df)
dt[,`:=`(
    fch = as.character(FO.variable),
    rn  = 1:.N
)]

dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]

dt[,c("fch","rn"):=NULL]

The idea is to

  1. identify all pairs of rn & variable (saved in dtvars) and
  2. see which of these pairs match with rn & F0.variable pairs (in the original table, dt).
查看更多
Lonely孤独者°
3楼-- · 2019-06-24 00:32

I read the data

df = dget("http://dpaste.com/2SJ6DPX.txt")

then split the 'variable' column into its parts and figured out the lengths of each entry

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

Then I unlisted v and created an index that maps the unlisted v to the row from which it came from

uv = unlist(v)
idx = rep(seq_along(v), len)

Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable

test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE

Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))

f0 = function(dd) {
    v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
    len = lengths(v)
    uv = unlist(v)
    idx = rep(seq_along(v), len)

    keep = logical(nrow(dd))
    keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
    keep
}

(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)

f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))

with

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

查看更多
Bombasti
4楼-- · 2019-06-24 00:42

I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.

transform(dd, Keep = mapply(grepl, 
                            as.character(FO.variable), 
                            as.character(variable), 
                            fixed = TRUE))

#    VisitorIDTrue                        variable value      FO.variable FO.value  Keep
# 22      44888657 Direct / Unknown,Organic Search     1 Direct / Unknown        1  TRUE
# 2       44888657   Direct / Unknown,System Email     1 Direct / Unknown        1  TRUE
# 6       44888657             Direct / Unknown,TV     1 Direct / Unknown        1  TRUE
# 10      44888657     Organic Search,System Email     1 Direct / Unknown        1 FALSE
# 18      44888657               Organic Search,TV     1 Direct / Unknown        1 FALSE
# 14      44888657                 System Email,TV     1 Direct / Unknown        1 FALSE
# 24      44888657 Direct / Unknown,Organic Search     1   Organic Search        1  TRUE
# 4       44888657   Direct / Unknown,System Email     1   Organic Search        1 FALSE
...
查看更多
登录 后发表回答