Finding repeated substrings with R

2020-08-02 03:26发布

问题:

I have the following code for finding out a pattern (consecutively repeated substring) in a string, say 0110110110000. The output patterns are 011 and 110, since they are both repeated within the string. What changes can be done to the following code?

I'd like to identify substrings that start from any position in a given string, and which repeat for at least a threshold number of times. In the above mentioned string, the threshold is three (th = 3). The repeated string should be the maximal repeated string. In the above string, 110 and 011 both satisfy these conditions.

Here's my attempt at doing this:

reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times

find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
    pat <- paste0("(.{", k, "})", reps("\\1", th-1))
    r <- regexpr(pat, string, perl = TRUE)
    if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}

回答1:

You can do this with regex:

s <- '0110110110000'

thr <- 3

m <- gregexpr(sprintf('(?=(.+)(?:\\1){%s,})', thr-1), s, perl=TRUE)

unique(mapply(function(x, y) substr(s, x, x+y-1), 
              attr(m[[1]], 'capture.start'), 
              attr(m[[1]], 'capture.length')))

# [1] "011" "110" "0"

The pattern in the gregexpr uses a positive lookahead to prevent characters from being consumed by the match (and so allowing overlapping matches, such as with the 011 and 110). We use a repeated (at least thr - 1 times) backreference to the captured group to look for repeated substrings.

Then we can extract the matched substrings by taking start positions and lengths from the attributes of the result of gregexpr, i.e. the object m.

You didn't specify a minimum string length, so this returns 0 as one of the repeated substrings. If you have a minimum and/or maximum substring length in mind, you can modify the first subexpression of the regex. For example, the following would match only substrings with at least 2 characters.

sprintf('(?=(.{2,})(?:\\1){%s,})', thr-1)


标签: r