I have two character variables (names of objects) and I want to extract the largest common substring.
a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')
I want the following as a result:
[1] "ABC" "DEF"
These vectors as input should give the same result:
a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')
These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.
Is there a solution, in one of the following places (in order of preference):
Base R
Recommended Packages
Packages available on CRAN
The answer to the supposed-duplicate does not fulfill these requirements.
Because I have too many things I don't want to do, I did this instead:
Anyone care to do a statistical estimate of the actual distribution of matching strings? (
lcstring
is just a brute-force home-rolled function; the output contains all max strings which is why I only look at the first list element)If you dont mind using bioconductor packages, then, You can use
Rlibstree
. The installation is pretty straightforward.Then, you can do:
On a side note: I'm not quite sure if
Rlibstree
useslibstree 0.42
orlibstree 0.43
. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was usinglibstree 0.42
. Just a heads up.Here's a CRAN package for that: