String split with conditions in R

2020-05-22 00:13发布

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.

mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")

result

"MODY_60.2"  "MODY_116.21" "MODY_116.3"  "MODY_116.4"

7条回答
Bombasti
2楼-- · 2020-05-22 00:26
^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$

You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.

https://regex101.com/r/wL4aB6/1

查看更多
爷、活的狠高调
3楼-- · 2020-05-22 00:33

With the stringr package:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.

查看更多
闹够了就滚
4楼-- · 2020-05-22 00:37
gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"
查看更多
狗以群分
5楼-- · 2020-05-22 00:39

You can do this using gsubfn

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"            

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.

查看更多
我欲成王,谁敢阻挡
6楼-- · 2020-05-22 00:48

gregexpr can search for a pattern in strings and give the location.

First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).

After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.

inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
       substr(mystring, 1, inds - 1), 
       sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4" 
查看更多
劫难
7楼-- · 2020-05-22 00:51

A little longer, but needs less regular expression knowledge:

library(stringr)
indx <- str_locate_all(mystring, "_")

for (i in seq_along(indx)) {
  if (nrow(indx[[i]]) == 1) {
    mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
  } else {
    mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
  }
}
查看更多
登录 后发表回答