I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.
To explain, if the strings are string1 : "hello" string2 : "hel12345lo" I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.
library(qualV)
a= "hello"
b="hel123l5678o"
sapply(seq_along(a), function(i)
paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
collapse = ""))
I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.
> a = "hello"
> b = "h1e2l3l4o5"
> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"
> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
21
Here are three possible solutions.
There are also
adist()
andagrep()
in base R, and thestringdist
package has a few functions that run the LCS method. Here's a look atstringsidt
. It returns the number of unpaired characters.Now that I've explored this a bit more, I think
adist()
might be the way to go. If we setcounts=TRUE
we get a sequence of Matches, Insertions, etc. So if you give that tostri_locate()
we can use that matrix to get the matches from a to b.So the
M
values denote straight across matches. We can go and get the substrings withstri_sub()
Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.
Leveraging @RichardScriven's insight that
adist
could be used, but this function combines it all,EDIT This was tricky because we needed to get the
longest_string
in two contexts, so I made this function:This combines @RichardSriven's work using the library...
We can rewrite it to avoid the use of any external libraries (but still using the
adist
)...All of identify only
'hello '
as the longest common substring, instead of 'hello r'
:EDIT And since the edit, works regardless of which argument is the longer of the two:
LAST EDIT But this is only good if you accept this behavior. Notice this result:
I was expecting
'hello'
, but since the transformation actually moves (via deletion) the world to become the hello, so only the hell part is considered a match according to theM
:This behavior is observed using [this Levenstein tool] -- it gives two possible solutions, equivalent to these two transforms; can we tell adist which one we prefer? (the one with the greater number of consecutive
M
)Finally, don't forget adist allows you to pass in
ignore.case = TRUE
(FALSE
is the default)I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the
LCS
function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.
Point A above:
Point B above: