可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a file containing a certain number of lines. Each line looks like this:
TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1
I would like to remove all before ":" character in order to retain only PKMYT1 that is a gene name.
Since I'm not an expert in regex scripting can anyone help me to do this using Unix (sed or awk) or in R?
回答1:
Here are two ways of doing it in R:
foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
# Remove all before and up to ":":
gsub(".*:","",foo)
# Extract everything behind ":":
regmatches(foo,gregexpr("(?<=:).*",foo,perl=TRUE))
回答2:
A simple regular expression used with gsub()
:
x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
gsub(".*:", "", x)
"PKMYT1"
See ?regex
or ?gsub
for more help.
回答3:
There are certainly more than 2 ways in R. Here's another.
unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2))
If the string has a constant length I imagine substr
would be faster than this or regex methods.
回答4:
Using sed:
sed 's/.*://' < your_input_file > output_file
This will replace anything followed by a colon with nothing, so it'll remove everything up to and including the last colon on each line (because *
is greedy by default).
As per Josh O'Brien's comment, if you wanted to only replace up to and including the first colon, do this:
sed "s/[^:]*://"
That will match anything that isn't a colon, followed by one colon, and replace with nothing.
Note that for both of these patterns they'll stop on the first match on each line. If you want to make a replace happen for every match on a line, add the 'g
' (global) option to the end of the command.
Also note that on linux (but not on OSX) you can edit a file in-place with -i
eg:
sed -i 's/.*://' your_file
回答5:
You can use awk
like this:
awk -F: '{print $2}' /your/file
回答6:
If you have GNU coreutils
available use cut
:
cut -d: -f2 infile
回答7:
I was working on a similar issue. John's and Josh O'Brien's advice did the trick. I started with this tibble:
library(dplyr)
my_tibble <- tibble(Col1=c("ABC:Content","BCDE:MoreContent","FG:Conent:with:colons"))
It looks like:
| Col1
1 | ABC:Content
2 | BCDE:MoreContent
3 | FG:Content:with:colons
I needed to create this tibble:
| Col1 | Col2 | Col3
1 | ABC:Content | ABC | Content
2 | BCDE:MoreContent | BCDE | MoreContent
3 | FG:Content:with:colons| FG | Content:with:colons
And did so with this code (R version 3.4.2).
my_tibble2 <- mutate(my_tibble
,Col2 = unlist(lapply(strsplit(Col1, ':',fixed = TRUE), '[', 1))
,Col3 = gsub("^[^:]*:", "", Col1))
回答8:
Below are 2 equivalent solutions:
The first uses perl's -a
autosplit feature to split each line into fields using :
, populate the F
fields array, and print the 2nd field $F[1]
(counted starting from field 0)
perl -F: -lane 'print $F[1]' file
The second uses a regular expression to substitute s///
from ^
the beginning of the line, .*:
any characters ending with a colon, with nothing
perl -pe 's/^.*://' file
回答9:
Some very simple move that I missed from the best response @Sacha Epskamp was to use the sub function, in this case to take everything before the ":"(instead of removing it), so it was very simple:
foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
# 1st, as she did to remove all before and up to ":":
gsub(".*:","",foo)
# 2nd, to keep everything before and up to ":":
gsub(":.*","",foo)
Basically, the same thing, just change the ":" position inside the sub argument. Hope it will help.