check if all characters of one string exist in ano

2019-01-28 12:07发布

I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.

I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another.

install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")

#[1] 0.3636364

I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.

2条回答
\"骚年 ilove
2楼-- · 2019-01-28 12:49

If the characters to be considered are only letters you could use:

comp <- function(s1, s2){         
     in1 = letters %in% strsplit(tolower(s1), "")[[1]]
     in2 = letters %in% strsplit(tolower(s2), "")[[1]]
     sum(in1 & in2)/sum(in1)
}
查看更多
Juvenile、少年°
3楼-- · 2019-01-28 13:05

Here is one approach

s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"

compare <- function(s1, s2) {
    c1 <- unique(strsplit(s1, "")[[1]])
    c2 <- unique(strsplit(s2, "")[[1]])
    length(intersect(c1,c2))/length(c1)
}

compare(s1,s2)
#1

It may be a little slow, though. And it considers the space character as character, too. Use Vectorize to apply on a column:

dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))
查看更多
登录 后发表回答