String split with conditions in R

I have this mystring with the delimiter _. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal" and get the result as shown below.

mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")

result

"MODY_60.2"  "MODY_116.21" "MODY_116.3"  "MODY_116.4"

标签： regex r string split

7条回答

Bombasti

2楼-- · 2020-05-22 00:26

^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$

You can simply do using gsub without using any complex regex.Just replace by \\1.See demo.

https://regex101.com/r/wL4aB6/1

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2020-05-22 00:33

With the stringr package:

str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"

It also works with more than two delimiters.

0人赞添加讨论(0) 举报

闹够了就滚

4楼-- · 2020-05-22 00:37

gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"

0人赞添加讨论(0) 举报

狗以群分

5楼-- · 2020-05-22 00:39

You can do this using gsubfn

library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"

This allows for cases when you have more than two "_", and you want to split on the second one, for example,

mystring<-c("MODY_60.2.ReCal.sort.bam",
            "MODY_116.21_C4U.ReCal.sort.bam",
            "MODY_116.3_C2RX-1-10.ReCal.sort.bam",
            "MODY_116.4.ReCal.sort.bam",
            "MODY_116.4_asdfsadf_1212_asfsdf",
            "MODY_116.5.ReCal_asdfsadf_1212_asfsdf",  # split by second "_", leaving ".ReCal"
            "MODY")

gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2"        "MODY_116.21"      "MODY_116.3"       "MODY_116.4"      
# [5] "MODY_116.4"       "MODY_116.5.ReCal" "MODY"

In the function, f, x is the original string, y and z are the next matches. So, if z is not a "_", then it proceeds with the splitting by the alternative string.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

6楼-- · 2020-05-22 00:48

gregexpr can search for a pattern in strings and give the location.

First, we use gregexpr to find the location of all _ in each element of mystring. Then, we loop through that output and extract the index of second _ within each element of mystring. If there is no second _, it'll return an NA (check inds in the example below).

After that, we can either extract the relevant part using substr based on the extracted index or, if there is NA, we can split the string at .ReCal and keep only the first part.

inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
       substr(mystring, 1, inds - 1), 
       sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2"   "MODY_116.21" "MODY_116.3"  "MODY_116.4"

0人赞添加讨论(0) 举报

劫难

7楼-- · 2020-05-22 00:51

A little longer, but needs less regular expression knowledge:

library(stringr)
indx <- str_locate_all(mystring, "_")

for (i in seq_along(indx)) {
  if (nrow(indx[[i]]) == 1) {
    mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
  } else {
    mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
  }
}

0人赞添加讨论(0) 举报

1 2 下一页

String split with conditions in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间