I have 2 lists with numbers and I need to match the values of one list with the other. The match has to be done based on the beginning of the number. It has to return the row_id of the longest match that is possible.
lookup value: 12345678
find_list:
a 1
b 12
c 123
d 124
e 125
f 1234
g 1235
In this example we would have a match with a,b,c,f
and R must return f
. Since f
is the longest and therefore the best match.
I now have used the startsWith
function in R. From that answer I choose the value that is the longest. But the problem is that the lists are huge. I have 18.5 Million lookup values and 300,000 possible values in the find_list
and R crashes after a while.
Is there a smarter way to do this?
Here is one method in base R.
This returns
You can probably speed this up by replacing base R's
match
function with thefastmatch
function from the package of the same name as it will hash the table values if you search over these a second time.data
DATA
Maybe there is a smarter way of doing what you want but the following produces the result in the question.
You will need package
stringi
installed.First, the data in the question.
Now the code.
Here is an option in case you can convert your find_list into a
data.table
:This returns also multiple row id's in case there are duplicates. Instead of just returning row numbers, you could have a second column with IDs to be returned.
For a list of 20 million integers it takes much less than a second.