Find the longest match of 2 integers in R

2019-04-11 09:48发布

I have 2 lists with numbers and I need to match the values of one list with the other. The match has to be done based on the beginning of the number. It has to return the row_id of the longest match that is possible.

lookup value: 12345678

find_list:
a   1
b   12
c   123
d   124
e   125
f   1234
g   1235

In this example we would have a match with a,b,c,f and R must return f. Since f is the longest and therefore the best match.

I now have used the startsWith function in R. From that answer I choose the value that is the longest. But the problem is that the lists are huge. I have 18.5 Million lookup values and 300,000 possible values in the find_list and R crashes after a while.

Is there a smarter way to do this?

标签: r match
4条回答
淡お忘
2楼-- · 2019-04-11 10:22

Here is one method in base R.

# construct a vector of all possible matches for the lookup value
lookupVec <- floor(lookup * (10 ^ (-1 * (0:(nchar(lookup)-1)))))

This returns

lookupVec
[1] 1234567  123456   12345    1234     123      12       1

# find the value of the first variable that matches the maximum value
# lower values in the vector

dat$V1[which.min(match(dat$V2, lookupVec))]
[1] f
Levels: a b c d e f g

You can probably speed this up by replacing base R's match function with the fastmatch function from the package of the same name as it will hash the table values if you search over these a second time.

data

dat <-
structure(list(V1 = structure(1:7, .Label = c("a", "b", "c", 
"d", "e", "f", "g"), class = "factor"), V2 = c(1L, 12L, 123L, 
124L, 125L, 1234L, 1235L)), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -7L))

lookup <- 12345678
查看更多
太酷不给撩
3楼-- · 2019-04-11 10:23
find_list$X[which.max(sapply(find_list$find_list, function(myX)
    attr(gregexpr(myX, lookup_value)[[1]], "match.length")))]
#[1] "f"

DATA

find_list = structure(list(X = c("a", "b", "c", "d", "e", "f", "g"), find_list = c(1L, 
12L, 123L, 124L, 125L, 1234L, 1235L)), .Names = c("X", "find_list"
), class = "data.frame", row.names = c(NA, -7L))

lookup_value = 12345678
查看更多
祖国的老花朵
4楼-- · 2019-04-11 10:27

Maybe there is a smarter way of doing what you want but the following produces the result in the question.
You will need package stringi installed.
First, the data in the question.

lookup <- "12345678"
find_list <- read.table(text = "
a   1
b   12
c   123
d   124
e   125
f   1234
g   1235
")
find_list$V2 <- as.character(find_list$V2)

Now the code.

inx <- which(stringi::stri_detect(lookup, regex = find_list$V2))
inx <- inx[which.max(nchar(find_list$V2[inx]))]
find_list[inx, ]
#  V1   V2
#6  f 1234
查看更多
Anthone
5楼-- · 2019-04-11 10:42

Here is an option in case you can convert your find_list into a data.table:

y <- 123456789

x <- data.table(sample(1:1000000, 1000000, replace = T))  # find list
n <- round(log(y, base = 10)) + 1  # number of digits
z <- floor(y/(10^(1:(n))))  # split up into all possible integers

x[V1 == x[V1 %in% z, max(.SD),],, which = T]

This returns also multiple row id's in case there are duplicates. Instead of just returning row numbers, you could have a second column with IDs to be returned.

For a list of 20 million integers it takes much less than a second.

Unit: seconds
                                           expr        min          lq       mean      median         uq       max neval
 x[V1 == x[V1 %in% z, max(.SD), ], , which = T] 0.00076113 0.000871416 0.02571112 0.000945884 0.00109958 0.6195882    25
查看更多
登录 后发表回答