Identify a common pattern [duplicate]

2020-04-14 07:54发布

This question already has answers here:

Find common substrings between two character variables (3 answers)

Is there a (easy) possibility to identify a common pattern which two strings share? Here is a little example to make clear what I mean:

I have two variables containing a string. Both include the same pattern ("ABC") and also some "noise".

a <- "xxxxxxxxxxxABCxxxxxxxxxxxx"
b <- "yyyyyyyyyyyyyyyyyyyyyyyABC"

Lets say I don't know the common pattern and I want R to find out that both strings contain "ABC". How can I do this?

*edit

The first example was maybe a bit to simplistic. Here is a example from my real data.

a <- "DUISBURG-HAMBORNS"
b <- "DUISBURG (-31.7.29)S"

Both strings contain "DUISBURG" which I want the function to identify.

*edit

I took the solution proposed in the link posted in the comments. But I still have not exactly what I want.

library(qualV)
LCS(strsplit(a[1], '')[[1]],strsplit(b[1], '')[[1]])$LCS

[1] "D" "U" "I" "S" "B" "U" "R" "G" "-" " " " " "S"

If the function is looking for the longest common subsequence of the two vectors, why does it not stop after "D" "U" "I" "S" "B" "U" "R" "G"? .

a <- "WWDUISBURG-HAMBORNS" b <- "QQQQQQDUISBURG (-31.7.29)S" A <- strsplit(a, "")[[1]] B <- strsplit(b, "")[[1]] L <- matrix(0, length(A), length(B)) ones <- which(outer(A, B, "=="), arr.ind = TRUE) ones <- ones[order(ones[, 1]), ] for(i in 1:nrow(ones)) { v <- ones[i, , drop = FALSE] L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1) } paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "") # [1] "DUISBURG"