可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have this mystring
with the delimiter _
. The condition here is if there are two or more delimiters, I want to split at the second delimiter and if there is only one delimiter, I want to split at ".Recal"
and get the result
as shown below.
mystring<-c("MODY_60.2.ReCal.sort.bam","MODY_116.21_C4U.ReCal.sort.bam","MODY_116.3_C2RX-1-10.ReCal.sort.bam","MODY_116.4.ReCal.sort.bam")
result
"MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
回答1:
You can do this using gsubfn
library(gsubfn)
f <- function(x,y,z) if (z=="_") y else strsplit(x, ".ReCal", fixed=T)[[1]][[1]]
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
This allows for cases when you have more than two "_", and you want to split on the second one, for example,
mystring<-c("MODY_60.2.ReCal.sort.bam",
"MODY_116.21_C4U.ReCal.sort.bam",
"MODY_116.3_C2RX-1-10.ReCal.sort.bam",
"MODY_116.4.ReCal.sort.bam",
"MODY_116.4_asdfsadf_1212_asfsdf",
"MODY_116.5.ReCal_asdfsadf_1212_asfsdf", # split by second "_", leaving ".ReCal"
"MODY")
gsubfn("([^_]+_[^_]+)(.).*", f, mystring, backref=2)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
In the function, f
, x
is the original string, y
and z
are the next matches. So, if z
is not a "_", then it proceeds with the splitting by the alternative string.
回答2:
With the stringr
package:
str_extract(mystring, '.*?_.*?(?=_)|^.*?_.*(?=\\.ReCal)')
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
It also works with more than two delimiters.
回答3:
Perl/PCRE has the branch reset feature that lets you reuse a group number when you have capturing groups in different alternatives, and is considered as one capturing group.
IMO, this feature is elegant when you want to supply different alternatives.
x <- c('MODY_60.2.ReCal.sort.bam', 'MODY_116.21_C4U.ReCal.sort.bam',
'MODY_116.3_C2RX-1-10.ReCal.sort.bam', 'MODY_116.4.ReCal.sort.bam',
'MODY_116.4_asdfsadf_1212_asfsdf', 'MODY_116.5.ReCal_asdfsadf_1212_asfsdf', 'MODY')
sub('^(?|([^_]*_[^_]*)_.*|(.*)\\.ReCal.*)$', '\\1', x, perl=T)
# [1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
# [5] "MODY_116.4" "MODY_116.5.ReCal" "MODY"
回答4:
gsub('^(.*\\.\\d+).*','\\1',mystring)
[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"
回答5:
^([^_\\n]*_[^_\\n]*)(?:_.*|\\.ReCal[^_]*)$
You can simply do using gsub
without using any complex regex.Just replace by \\1
.See demo.
https://regex101.com/r/wL4aB6/1
回答6:
A little longer, but needs less regular expression knowledge:
library(stringr)
indx <- str_locate_all(mystring, "_")
for (i in seq_along(indx)) {
if (nrow(indx[[i]]) == 1) {
mystring[i] <- strsplit(mystring[i], ".ReCal")[[1]][1]
} else {
mystring[i] <- substr(mystring[i], start = 1, stop = indx[[i]][2] - 1)
}
}
回答7:
gregexpr
can search for a pattern in strings and give the location.
First, we use gregexpr
to find the location of all _
in each element of mystring
. Then, we loop through that output and extract the index of second _
within each element of mystring
. If there is no second _
, it'll return an NA
(check inds
in the example below).
After that, we can either extract the relevant part using substr
based on the extracted index or, if there is NA
, we can split the string at .ReCal
and keep only the first part.
inds = sapply(gregexpr("_", mystring, fixed = TRUE), function(x) x[2])
ifelse(!is.na(inds),
substr(mystring, 1, inds - 1),
sapply(strsplit(mystring, ".ReCal"), '[', 1))
#[1] "MODY_60.2" "MODY_116.21" "MODY_116.3" "MODY_116.4"