I have a data.table dt with 3 columns:
- id
- name as string
- threshold as num
A sample is:
dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6))
nid | rname | maxr
n1 | apple | 0.5
n2 | pear | 0.8
n3 | banana | 0.7
n4 | kiwi | 0.6
I have a second table dt.ref with 2 columns:
- id
- name as string
A sample is:
dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis"))
cid | cname
c1 | apple
c2 | maple
c3 | peer
c4 | dear
c5 | bonobo
c6 | kiwis
For each rname of dt, I would like to compute the Levenshtein ratio with each cname of dt.ref as such:
Lr = 1 - (stringdist(cname, rname, method = "lv") / pmax(nchar(cname),nchar(rname)))
Then, I would like to find max(Lr) over the cname for each rname of dt and get as an output the following data.table:
nid | rname | maxr | maxLr | cid
n1 | apple | 0.5 | 1 | c1
n2 | pear | 0.8 | 0.75 | c3
n2 | pear | 0.8 | 0.75 | c4
n3 | banana | 0.7 | 0.33 | c5
n4 | kiwi | 0.6 | 0.8 | c6
Basically, we take dt and add 2 columns, the maximum Levenshtein ratio and the corresponding cid, knowing that ties are all added, 1 per row as for n2.
I use data.table
but the solution can use dplyr
or any other package.
You can try something like this:
Results: