R - Longest common substring

2019-01-22 23:27发布

问题:

Does anyone know of an R package that solves the longest common substring problem? I am looking for something fast that could work on vectors.

回答1:

Check out the "Rlibstree" package on omegahat: http://www.omegahat.org/Rlibstree/.

This uses http://www.icir.org/christian/libstree/.



回答2:

You should look at the LCS function of qualV package. It is C-implemented, therefore quite efficient.



回答3:

The question here is not totally clear on the intended application of the solution to the longest common substring problem. A common application that I encounter is matching between names in different datasets. The stringdist package has a useful function amatch() which I find suitable for this task.

In brief, amatch() will take as input two vectors, the first is x the vector of strings that you want to find matches from (this can also just be a single string), the second is table, which is the vector of strings you want to make comparisons to and choose the match with the longest common substring. amatch() will then return a vector whose length equals that of x - each element of this result will be an index in table that contains the best match.

Details: amatch() takes a method argument, which you specify to be lcs if you want matching on longest common substring. There are many other options for different string matching techniques (e.g. Levenshtein distance). There is also a mandatory maxDist argument. If all strings in table are greater "distance" from a given string in x, then amatch() will return NA for that element of its output. "distance" is defined differently depending on the string matching algorithm you choose. For lcs, it (more or less) just means how many different (non-matched) characters there are. See documentation for details.

Parallelization: another nice feature of amatch() is that it will automatically parallelize the operation for you, making reasonable guesses about system resources to use. If you want more control over this, you can toggle the nthread argument.

Example application:

library(stringdist)

Names1 = c(
"SILVER EAGLE REFINING, INC. (SW)",
"ANTELOPE REFINING",
"ANTELOPE REFINING (DOUGLAS FACILITY)"
)

Names2 = c(
"Mobile Concrete, Inc.",
"Antelope Refining, LLC. ",
"Silver Eagle Refining Inc."
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Match_Idx
# [1] 3 2 2

Matches = data.frame(Names1, Names2[Match_Idx])
Matches

#                                 Names1          Names2.Match_Idx.
# 1     silver eagle refining, inc. (sw) silver eagle refining inc.
# 2                    antelope refining   antelope refining, llc. 
# 3 antelope refining (douglas facility)   antelope refining, llc. 

### Compare Matches:

Matches$Distance = stringdist(Matches$Names1, Matches$Match, method = 'lcs')

Also, unlike functions like LCS from qualV, this will not consider "subsequence" matches that involve ignoring intermediate characters in order to form a match (as discussed here). For instance, see this:

Names1 = c(
"hello"
)

Names2 = c(
"hel123l5678o",
"hell"
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)

Matches = data.frame(Names1, Match = Names2[Match_Idx])
Matches

# 1  hello  hell


回答4:

I don't know R, but I used to implement Hirschberg's algorithm which is fast and don't consume too much space.

As I remember it is only 2 or 3 recursively called short functions.

Here is a link: http://wordaligned.org/articles/longest-common-subsequence

So don't hesitate to implement it in R, it worths the effort since it is a very interesting algorithm.