String split with conditions in R

2020-05-22 00:45发布

问题:

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.

mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")

result

"MODY_60.2"  "MODY_116.21" "MODY_116.3"  "MODY_116.4"

回答1:

You can do this using gsubfn

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"            

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.



回答2:

With the stringr package:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.



回答3:

Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.

IMO, this feature is elegant when you want to supply different alternatives.

x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam', 
       'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
       'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')

sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"  


回答4:

gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"


回答5:

^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$

You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.

https://regex101.com/r/wL4aB6/1



回答6:

A little longer, but needs less regular expression knowledge:

library(stringr)
indx <- str_locate_all(mystring, "_")

for (i in seq_along(indx)) {
  if (nrow(indx[[i]]) == 1) {
    mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
  } else {
    mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
  }
}


回答7:

gregexpr can search for a pattern in strings and give the location.

First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).

After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.

inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
       substr(mystring, 1, inds - 1), 
       sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"